Help on package h2o:
NAME
h2o - :mod:`h2o` -- module for using H2O services.
PACKAGE CONTENTS
assembly
astfun
auth
automl (package)
backend (package)
base
cross_validation
demos
display
estimators (package)
exceptions
explanation (package)
expr
expr_optimizer
frame
grid (package)
group_by
h2o
information_retrieval (package)
job
model (package)
persist (package)
pipeline (package)
schemas (package)
sklearn (package)
transforms (package)
tree (package)
two_dim_table
utils (package)
FUNCTIONS
api(endpoint, data=None, json=None, filename=None, save_to=None)
Perform a REST API request to a previously connected server.
This function is mostly for internal purposes, but may occasionally be useful for direct access to
the backend H2O server. It has same parameters as :meth:`H2OConnection.request <h2o.backend.H2OConnection.request>`.
:examples:
>>> res = h2o.api("GET /3/NetworkTest")
>>> res["table"].show()
as_list(data, use_pandas=True, header=True)
Convert an H2O data object into a python-specific object.
WARNING! This will pull all data local!
If Pandas is available (and use_pandas is True), then pandas will be used to parse the
data frame. Otherwise, a list-of-lists populated by character data will be returned (so
the types of data will all be str).
:param data: an H2O data object.
:param use_pandas: If True, try to use pandas for reading in the data.
:param header: If True, return column names as first element in list
:returns: List of lists (Rows x Columns).
:examples:
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> from h2o.utils.typechecks import assert_is_type
>>> res1 = h2o.as_list(iris, use_pandas=False)
>>> assert_is_type(res1, list)
>>> res1 = list(zip(*res1))
>>> assert abs(float(res1[0][9]) - 4.4) < 1e-10 and abs(float(res1[1][9]) - 2.9) < 1e-10 and ... abs(float(res1[2][9]) - 1.4) < 1e-10, "incorrect values"
>>> res1
assign(data, xid)
(internal) Assign new id to the frame.
:param data: an H2OFrame whose id should be changed
:param xid: new id for the frame.
:returns: the passed frame.
:examples:
>>> old_name = "prostate.csv"
>>> new_name = "newProstate.csv"
>>> training_data = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"),
... destination_frame=old_name)
>>> temp=h2o.assign(training_data, new_name)
cluster()
Return :class:`H2OCluster` object describing the backend H2O cluster.
:examples:
>>> import h2o
>>> h2o.init()
>>> h2o.cluster()
cluster_info()
Deprecated, use ``h2o.cluster().show_status()``.
Deprecated.
cluster_status()
Deprecated, use ``h2o.cluster().show_status(True)``.
Deprecated.
connect(server=None, url=None, ip=None, port=None, https=None, verify_ssl_certificates=None, cacert=None, auth=None, proxy=None, cookies=None, verbose=True, config=None, strict_version_check=False)
Connect to an existing H2O server, remote or local.
There are two ways to connect to a server: either pass a `server` parameter containing an instance of
an H2OLocalServer, or specify `ip` and `port` of the server that you want to connect to.
:param server: An H2OLocalServer instance to connect to (optional).
:param url: Full URL of the server to connect to (can be used instead of `ip` + `port` + `https`).
:param ip: The ip address (or host name) of the server where H2O is running.
:param port: Port number that H2O service is listening to.
:param https: Set to True to connect via https:// instead of http://.
:param verify_ssl_certificates: When using https, setting this to False will disable SSL certificates verification.
:param cacert: Path to a CA bundle file or a directory with certificates of trusted CAs (optional).
:param auth: Either a (username, password) pair for basic authentication, an instance of h2o.auth.SpnegoAuth
or one of the requests.auth authenticator objects.
:param proxy: Proxy server address.
:param cookies: Cookie (or list of) to add to request
:param verbose: Set to False to disable printing connection status messages.
:param config: Connection configuration object encapsulating connection parameters.
:param strict_version_check: If True, an error will be raised if the client and server versions don't match.
:returns: the new :class:`H2OConnection` object.
:examples:
>>> import h2o
>>> ipA = "127.0.0.1"
>>> portN = "54321"
>>> urlS = "http://127.0.0.1:54321"
>>> connect_type=h2o.connect(ip=ipA, port=portN, verbose=True)
# or
>>> connect_type2 = h2o.connect(url=urlS, https=True, verbose=True)
connection()
Return the current :class:`H2OConnection` handler.
:examples:
>>> temp = h2o.connection()
>>> temp
create_frame(frame_id=None, rows=10000, cols=10, randomize=True, real_fraction=None, categorical_fraction=None, integer_fraction=None, binary_fraction=None, time_fraction=None, string_fraction=None, value=0, real_range=100, factors=100, integer_range=100, binary_ones_fraction=0.02, missing_fraction=0.01, has_response=False, response_factors=2, positive_response=False, seed=None, seed_for_column_types=None)
Create a new frame with random data.
Creates a data frame in H2O with real-valued, categorical, integer, and binary columns specified by the user.
:param frame_id: the destination key. If empty, this will be auto-generated.
:param rows: the number of rows of data to generate.
:param cols: the number of columns of data to generate. Excludes the response column if has_response is True.
:param randomize: If True, data values will be randomly generated. This must be True if either
categorical_fraction or integer_fraction is non-zero.
:param value: if randomize is False, then all real-valued entries will be set to this value.
:param real_range: the range of randomly generated real values.
:param real_fraction: the fraction of columns that are real-valued.
:param categorical_fraction: the fraction of total columns that are categorical.
:param factors: the number of (unique) factor levels in each categorical column.
:param integer_fraction: the fraction of total columns that are integer-valued.
:param integer_range: the range of randomly generated integer values.
:param binary_fraction: the fraction of total columns that are binary-valued.
:param binary_ones_fraction: the fraction of values in a binary column that are set to 1.
:param time_fraction: the fraction of randomly created date/time columns.
:param string_fraction: the fraction of randomly created string columns.
:param missing_fraction: the fraction of total entries in the data frame that are set to NA.
:param has_response: A logical value indicating whether an additional response column should be prepended to the
final H2O data frame. If set to True, the total number of columns will be ``cols + 1``.
:param response_factors: if has_response is True, then this variable controls the type of the "response" column:
setting response_factors to 1 will generate real-valued response, any value greater or equal than 2 will
create categorical response with that many categories.
:param positive_reponse: when response variable is present and of real type, this will control whether it
contains positive values only, or both positive and negative.
:param seed: a seed used to generate random values when ``randomize`` is True.
:param seed_for_column_types: a seed used to generate random column types when ``randomize`` is True.
:returns: an :class:`H2OFrame` object
:examples:
>>> dataset_params = {}
>>> dataset_params['rows'] = random.sample(list(range(50,150)),1)[0]
>>> dataset_params['cols'] = random.sample(list(range(3,6)),1)[0]
>>> dataset_params['categorical_fraction'] = round(random.random(),1)
>>> left_over = (1 - dataset_params['categorical_fraction'])
>>> dataset_params['integer_fraction'] =
... round(left_over - round(random.uniform(0,left_over),1),1)
>>> if dataset_params['integer_fraction'] + dataset_params['categorical_fraction'] == 1:
... if dataset_params['integer_fraction'] >
... dataset_params['categorical_fraction']:
... dataset_params['integer_fraction'] =
... dataset_params['integer_fraction'] - 0.1
... else:
... dataset_params['categorical_fraction'] =
... dataset_params['categorical_fraction'] - 0.1
>>> dataset_params['missing_fraction'] = random.uniform(0,0.5)
>>> dataset_params['has_response'] = False
>>> dataset_params['randomize'] = True
>>> dataset_params['factors'] = random.randint(2,5)
>>> print("Dataset parameters: {0}".format(dataset_params))
>>> distribution = random.sample(['bernoulli','multinomial',
... 'gaussian','poisson','gamma'], 1)[0]
>>> if distribution == 'bernoulli': dataset_params['response_factors'] = 2
... elif distribution == 'gaussian': dataset_params['response_factors'] = 1
... elif distribution == 'multinomial': dataset_params['response_factors'] = random.randint(3,5)
... else:
... dataset_params['has_response'] = False
>>> print("Distribution: {0}".format(distribution))
>>> train = h2o.create_frame(**dataset_params)
deep_copy(data, xid)
Create a deep clone of the frame ``data``.
:param data: an H2OFrame to be cloned
:param xid: (internal) id to be assigned to the new frame.
:returns: new :class:`H2OFrame` which is the clone of the passed frame.
:examples:
>>> training_data = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> new_name = "new_frame"
>>> training_copy = h2o.deep_copy(training_data, new_name)
>>> training_copy
demo(funcname, interactive=True, echo=True, test=False)
H2O built-in demo facility.
:param funcname: A string that identifies the h2o python function to demonstrate.
:param interactive: If True, the user will be prompted to continue the demonstration after every segment.
:param echo: If True, the python commands that are executed will be displayed.
:param test: If True, `h2o.init()` will not be called (used for pyunit testing).
:example:
>>> import h2o
>>> h2o.demo("gbm")
download_all_logs(dirname='.', filename=None, container=None)
Download H2O log files to disk.
:param dirname: a character string indicating the directory that the log file should be saved in.
:param filename: a string indicating the name that the CSV file should be.
Note that the default container format is .zip, so the file name must include the .zip extension.
:param container: a string indicating how to archive the logs, choice of "ZIP" (default) and "LOG"
ZIP: individual log files archived in a ZIP package
LOG: all log files will be concatenated together in one text file
:returns: path of logs written in a zip file.
:examples: The following code will save the zip file `'h2o_log.zip'` in a directory that is one down from where you are currently working into a directory called `your_directory_name`. (Please note that `your_directory_name` should be replaced with the name of the directory that you've created and that already exists.)
>>> h2o.download_all_logs(dirname='./your_directory_name/', filename = 'h2o_log.zip')
download_csv(data, filename)
Download an H2O data set to a CSV file on the local disk.
Warning: Files located on the H2O server may be very large! Make sure you have enough
hard drive space to accommodate the entire file.
:param data: an H2OFrame object to be downloaded.
:param filename: name for the CSV file where the data should be saved to.
:examples:
>>> iris = h2o.load_dataset("iris")
>>> h2o.download_csv(iris, "iris_delete.csv")
>>> iris2 = h2o.import_file("iris_delete.csv")
>>> iris2 = h2o.import_file("iris_delete.csv")
download_model(model, path='', export_cross_validation_predictions=False, filename=None)
Download an H2O Model object to the machine this python session is currently connected to.
The owner of the file saved is the user by which python session was executed.
:param model: The model object to download.
:param path: a path to the directory where the model should be saved.
:param export_cross_validation_predictions: logical, indicates whether the exported model
artifact should also include CV Holdout Frame predictions. Default is not to include the predictions.
:param filename: a filename for the saved model
:returns: the path of the downloaded model
:examples:
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> my_model = H2OGeneralizedLinearEstimator(family = "binomial")
>>> my_model.train(y = "CAPSULE",
... x = ["AGE", "RACE", "PSA", "GLEASON"],
... training_frame = h2o_df)
>>> h2o.download_model(my_model, path='')
download_pojo(model, path='', get_jar=True, jar_name='')
Download the POJO for this model to the directory specified by path; if path is "", then dump to screen.
:param model: the model whose scoring POJO should be retrieved.
:param path: an absolute path to the directory where POJO should be saved.
:param get_jar: retrieve the h2o-genmodel.jar also (will be saved to the same folder ``path``).
:param jar_name: Custom name of genmodel jar.
:returns: location of the downloaded POJO file.
:examples:
>>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> binomial_fit = H2OGeneralizedLinearEstimator(family = "binomial")
>>> binomial_fit.train(y = "CAPSULE",
... x = ["AGE", "RACE", "PSA", "GLEASON"],
... training_frame = h2o_df)
>>> h2o.download_pojo(binomial_fit, path='', get_jar=False)
enable_expr_optimizations(flag)
Enable expression tree local optimizations.
:examples:
>>> h2o.enable_expr_optimizations(True)
estimate_cluster_mem(ncols, nrows, num_cols=0, string_cols=0, cat_cols=0, time_cols=0, uuid_cols=0)
Computes an estimate for cluster memory usage in GB.
Number of columns and number of rows are required. For a better estimate you can provide a counts of different
types of columns in the dataset.
:param ncols: total number of columns in a dataset. An required parameter, integer, can't be negative
:param nrows: total number of rows in a dataset. An required parameter, integer, can't be negative
:param num_cols: number of numeric columns in a dataset. Integer, can't be negative.
:param string_cols: number of string columns in a dataset. Integer, can't be negative.
:param cat_cols: number of categorical columns in a dataset. Integer, can't be negative.
:param time_cols: number of time columns in a dataset. Integer, can't be negative.
:param uuid_cols: number of uuid columns in a dataset. Integer, can't be negative.
:return: An memory estimate in GB.
:example:
>>> from h2o import estimate_cluster_mem
>>> ### I will load an parquet file with 18 columns and 2 million lines
>>> estimate_cluster_mem(18, 2000000)
>>> ### I will load an other parquet file with 16 columns and 2 million lines, I ask for a more precise estimate
>>> ### because I know 12 of 16 columns are categorical and one of 16 columns consist of uuids.
>>> estimate_cluster_mem(18, 2000000, cat_cols=12, uuid_cols=1)
>>> ### I will load an parquet file with 8 columns and 31 million lines, I ask for a more precise estimate
>>> ### because I know 4 of 8 columns are categorical and 4 of 8 columns consist of numbers.
>>> estimate_cluster_mem(ncols=8, nrows=31000000, cat_cols=4, num_cols=4)
explain(models, frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r')
Generate model explanations on frame data set.
The H2O Explainability Interface is a convenient wrapper to a number of explainabilty
methods and visualizations in H2O. The function can be applied to a single model or group
of models and returns an object containing explanations, such as a partial dependence plot
or a variable importance plot. Most of the explanations are visual (plots).
These plots can also be created by individual utility functions/methods as well.
:param models: a list of H2O models, an H2O AutoML instance, or an H2OFrame with a 'model_id' column (e.g. H2OAutoML leaderboard)
:param frame: H2OFrame
:param columns: either a list of columns or column indices to show. If specified
parameter top_n_features will be ignored.
:param top_n_features: a number of columns to pick using variable importance (where applicable).
:param include_explanations: if specified, return only the specified model explanations
(Mutually exclusive with exclude_explanations)
:param exclude_explanations: exclude specified model explanations
:param plot_overrides: overrides for individual model explanations
:param figsize: figure size; passed directly to matplotlib
:param render: if True, render the model explanations; otherwise model explanations are just returned
:returns: H2OExplanation containing the model explanations including headers and descriptions
:examples:
>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain(test)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain(test)
explain_row(models, frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True)
Generate model explanations on frame data set for a given instance.
Explain the behavior of a model or group of models with respect to a single row of data.
The function returns an object containing explanations, such as a partial dependence plot
or a variable importance plot. Most of the explanations are visual (plots).
These plots can also be created by individual utility functions/methods as well.
:param models: H2OAutoML object, supervised H2O model, or list of supervised H2O models
:param frame: H2OFrame
:param row_index: row index of the instance to inspect
:param columns: either a list of columns or column indices to show. If specified
parameter top_n_features will be ignored.
:param top_n_features: a number of columns to pick using variable importance (where applicable).
:param include_explanations: if specified, return only the specified model explanations
(Mutually exclusive with exclude_explanations)
:param exclude_explanations: exclude specified model explanations
:param plot_overrides: overrides for individual model explanations
:param qualitative_colormap: a colormap name
:param figsize: figure size; passed directly to matplotlib
:param render: if True, render the model explanations; otherwise model explanations are just returned
:returns: H2OExplanation containing the model explanations including headers and descriptions
:examples:
>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the H2OAutoML explanation
>>> aml.explain_row(test, row_index=0)
>>>
>>> # Create the leader model explanation
>>> aml.leader.explain_row(test, row_index=0)
export_file(frame, path, force=False, sep=',', compression=None, parts=1, header=True, quote_header=True, parallel=False)
Export a given H2OFrame to a path on the machine this python session is currently connected to.
:param frame: the Frame to save to disk.
:param path: the path to the save point on disk.
:param force: if True, overwrite any preexisting file with the same path.
:param sep: field delimiter for the output file.
:param compression: how to compress the exported dataset (default none; gzip, bzip2 and snappy available)
:param parts: enables export to multiple 'part' files instead of just a single file.
Convenient for large datasets that take too long to store in a single file.
Use parts=-1 to instruct H2O to determine the optimal number of part files or
specify your desired maximum number of part files. Path needs to be a directory
when exporting to multiple files, also that directory must be empty.
Default is ``parts = 1``, which is to export to a single file.
:param header: if True, write out column names in the header line.
:param quote_header: if True, quote column names in the header.
:param parallel: use a parallel export to a single file (doesn't apply when num_parts != 1,
might create temporary files in the destination directory).
:examples:
>>> h2o_df = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/prostate/prostate.csv")
>>> h2o_df['CAPSULE'] = h2o_df['CAPSULE'].asfactor()
>>> rand_vec = h2o_df.runif(1234)
>>> train = h2o_df[rand_vec <= 0.8]
>>> valid = h2o_df[(rand_vec > 0.8) & (rand_vec <= 0.9)]
>>> test = h2o_df[rand_vec > 0.9]
>>> binomial_fit = H2OGeneralizedLinearEstimator(family = "binomial")
>>> binomial_fit.train(y = "CAPSULE",
... x = ["AGE", "RACE", "PSA", "GLEASON"],
... training_frame = train, validation_frame = valid)
>>> pred = binomial_fit.predict(test)
>>> h2o.export_file(pred, "/tmp/pred.csv", force = True)
flow()
Open H2O Flow in your browser.
:examples:
>>> python
>>> import h2o
>>> h2o.init()
>>> h2o.flow()
frame(frame_id)
Retrieve metadata for an id that points to a Frame.
:param frame_id: the key of a Frame in H2O.
:returns: dict containing the frame meta-information.
:examples:
>>> training_data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
>>> frame_summary = h2o.frame(training_data.frame_id)
>>> frame_summary
frames()
Retrieve all the Frames.
:returns: Meta information on the frames
:examples:
>>> arrestsH2O = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/USArrests.csv")
>>> h2o.frames()
get_frame(frame_id, **kwargs)
Obtain a handle to the frame in H2O with the frame_id key.
:param str frame_id: id of the frame to retrieve.
:returns: an :class:`H2OFrame` object
:examples:
>>> from h2o.frame import H2OFrame
>>> frame1 = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
>>> frame2 = h2o.get_frame(frame1.frame_id)
get_grid(grid_id)
Return the specified grid.
:param grid_id: The grid identification in h2o
:returns: an :class:`H2OGridSearch` instance.
:examples:
>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> x = ["DayofMonth", "Month"]
>>> hyper_parameters = {'learn_rate':[0.1,0.2],
... 'max_depth':[2,3],
... 'ntrees':[5,10]}
>>> search_crit = {'strategy': "RandomDiscrete",
... 'max_models': 5,
... 'seed' : 1234,
... 'stopping_metric' : "AUTO",
... 'stopping_tolerance': 1e-2}
>>> air_grid = H2OGridSearch(H2OGradientBoostingEstimator,
... hyper_params=hyper_parameters,
... search_criteria=search_crit)
>>> air_grid.train(x=x,
... y="IsDepDelayed",
... training_frame=airlines,
... distribution="bernoulli")
>>> fetched_grid = h2o.get_grid(str(air_grid.grid_id))
>>> fetched_grid
get_model(model_id)
Load a model from the server.
:param model_id: The model identification in H2O
:returns: Model object, a subclass of H2OEstimator
:examples:
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
... "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
... alpha=0,
... Lambda=1e-5)
>>> model.train(x=predictors,
... y=response,
... training_frame=airlines)
>>> model2 = h2o.get_model(model.model_id)
get_timezone()
Deprecated, use ``h2o.cluster().timezone``.
Deprecated.
import_file(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None, skipped_columns=None, custom_non_data_line_markers=None, partition_by=None, quotechar=None, escapechar=None)
Import a dataset that is already on the cluster.
The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster
cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed
multi-threaded pull of the data. The main difference between this method and :func:`upload_file` is that
the latter works with local files, whereas this method imports remote files (i.e. files local to the server).
If you running H2O server on your own machine, then both methods behave the same.
:param path: path(s) specifying the location of the data to import or a path to a directory of files to import
:param destination_frame: The unique hex key assigned to the imported file. If none is given, a key will be
automatically generated.
:param parse: If True, the file should be parsed after import. If False, then a list is returned containing the file path.
:param header: -1 means the first line is data, 0 means guess, 1 means first line is header.
:param sep: The field separator character. Values on each line of the file are separated by
this character. If not provided, the parser will automatically detect the separator.
:param col_names: A list of column names for the file.
:param col_types: A list of types or a dictionary of column names to types to specify whether columns
should be forced to a certain type upon import parsing. If a list, the types for elements that are
one will be guessed. The possible types a column may have are:
:param partition_by Names of the column the persisted dataset has been partitioned by.
- "unknown" - this will force the column to be parsed as all NA
- "uuid" - the values in the column must be true UUID or will be parsed as NA
- "string" - force the column to be parsed as a string
- "numeric" - force the column to be parsed as numeric. H2O will handle the compression of the numeric
data in the optimal manner.
- "enum" - force the column to be parsed as a categorical column.
- "time" - force the column to be parsed as a time column. H2O will attempt to parse the following
list of date time formats: (date) "yyyy-MM-dd", "yyyy MM dd", "dd-MMM-yy", "dd MMM yy", (time)
"HH:mm:ss", "HH:mm:ss:SSS", "HH:mm:ss:SSSnnnnnn", "HH.mm.ss" "HH.mm.ss.SSS", "HH.mm.ss.SSSnnnnnn".
Times can also contain "AM" or "PM".
:param na_strings: A list of strings, or a list of lists of strings (one list per column), or a dictionary
of column names to strings which are to be interpreted as missing values.
:param pattern: Character string containing a regular expression to match file(s) in the folder if `path` is a
directory.
:param skipped_columns: an integer list of column indices to skip and not parsed into the final frame from the import file.
:param custom_non_data_line_markers: If a line in imported file starts with any character in given string it will NOT be imported. Empty string means all lines are imported, None means that default behaviour for given format will be used
:param quotechar: A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
:param escapechar: (Optional) One ASCII character used to escape other characters.
:returns: a new :class:`H2OFrame` instance.
:examples:
>>> birds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/birds.csv")
import_hive_table(database=None, table=None, partitions=None, allow_multi_format=False)
Import Hive table to H2OFrame in memory.
Make sure to start H2O with Hive on classpath. Uses hive-site.xml on classpath to connect to Hive.
When database is specified as jdbc URL uses Hive JDBC driver to obtain table metadata. then
uses direct HDFS access to import data.
:param database: Name of Hive database (default database will be used by default), can be also a JDBC URL.
:param table: name of Hive table to import
:param partitions: a list of lists of strings - partition key column values of partitions you want to import.
:param allow_multi_format: enable import of partitioned tables with different storage formats used. WARNING:
this may fail on out-of-memory for tables with a large number of small partitions.
:returns: an :class:`H2OFrame` containing data of the specified Hive table.
:examples:
>>> basic_import = h2o.import_hive_table("default",
... "table_name")
>>> jdbc_import = h2o.import_hive_table("jdbc:hive2://hive-server:10000/default",
... "table_name")
>>> multi_format_enabled = h2o.import_hive_table("default",
... "table_name",
... allow_multi_format=True)
>>> with_partition_filter = h2o.import_hive_table("jdbc:hive2://hive-server:10000/default",
... "table_name",
... [["2017", "02"]])
import_mojo(mojo_path)
Imports an existing MOJO model as an H2O model.
:param mojo_path: Path to the MOJO archive on the H2O's filesystem
:return: An H2OGenericEstimator instance embedding given MOJO
:examples:
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> model = H2OGradientBoostingEstimator(ntrees = 1)
>>> model.train(x = ["Origin", "Dest"],
... y = "IsDepDelayed",
... training_frame=airlines)
>>> original_model_filename = tempfile.mkdtemp()
>>> original_model_filename = model.download_mojo(original_model_filename)
>>> mojo_model = h2o.import_mojo(original_model_filename)
import_sql_select(connection_url, select_query, username, password, optimize=True, use_temp_table=None, temp_table_name=None, fetch_mode=None, num_chunks_hint=None)
Import the SQL table that is the result of the specified SQL query to H2OFrame in memory.
Creates a temporary SQL table from the specified sql_query.
Runs multiple SELECT SQL queries on the temporary table concurrently for parallel ingestion, then drops the table.
Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath::
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
Also see h2o.import_sql_table. Currently supported SQL databases are MySQL, PostgreSQL, MariaDB, Hive, Oracle
and Microsoft SQL Server.
:param connection_url: URL of the SQL database connection as specified by the Java Database Connectivity (JDBC)
Driver. For example, "jdbc:mysql://localhost:3306/menagerie?&useSSL=false"
:param select_query: SQL query starting with `SELECT` that returns rows from one or more database tables.
:param username: username for SQL server
:param password: password for SQL server
:param optimize: DEPRECATED. Ignored - use fetch_mode instead. Optimize import of SQL table for faster imports.
:param use_temp_table: whether a temporary table should be created from select_query
:param temp_table_name: name of temporary table to be created from select_query
:param fetch_mode: Set to DISTRIBUTED to enable distributed import. Set to SINGLE to force a sequential read by a single node
from the database.
:param num_chunks_hint: Desired number of chunks for the target Frame.
:returns: an :class:`H2OFrame` containing data of the specified SQL query.
:examples:
>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false"
>>> select_query = "SELECT bikeid from citibike20k"
>>> username = "root"
>>> password = "abc123"
>>> my_citibike_data = h2o.import_sql_select(conn_url, select_query,
... username, password)
import_sql_table(connection_url, table, username, password, columns=None, optimize=True, fetch_mode=None, num_chunks_hint=None)
Import SQL table to H2OFrame in memory.
Assumes that the SQL table is not being updated and is stable.
Runs multiple SELECT SQL queries concurrently for parallel ingestion.
Be sure to start the h2o.jar in the terminal with your downloaded JDBC driver in the classpath::
java -cp <path_to_h2o_jar>:<path_to_jdbc_driver_jar> water.H2OApp
Also see :func:`import_sql_select`.
Currently supported SQL databases are MySQL, PostgreSQL, MariaDB, Hive, Oracle and Microsoft SQL.
:param connection_url: URL of the SQL database connection as specified by the Java Database Connectivity (JDBC)
Driver. For example, "jdbc:mysql://localhost:3306/menagerie?&useSSL=false"
:param table: name of SQL table
:param columns: a list of column names to import from SQL table. Default is to import all columns.
:param username: username for SQL server
:param password: password for SQL server
:param optimize: DEPRECATED. Ignored - use fetch_mode instead. Optimize import of SQL table for faster imports.
:param fetch_mode: Set to DISTRIBUTED to enable distributed import. Set to SINGLE to force a sequential read by a single node
from the database.
:param num_chunks_hint: Desired number of chunks for the target Frame.
:returns: an :class:`H2OFrame` containing data of the specified SQL table.
:examples:
>>> conn_url = "jdbc:mysql://172.16.2.178:3306/ingestSQL?&useSSL=false"
>>> table = "citibike20k"
>>> username = "root"
>>> password = "abc123"
>>> my_citibike_data = h2o.import_sql_table(conn_url, table, username, password)
init(url=None, ip=None, port=None, name=None, https=None, cacert=None, insecure=None, username=None, password=None, cookies=None, proxy=None, start_h2o=True, nthreads=-1, ice_root=None, log_dir=None, log_level=None, max_log_file_size=None, enable_assertions=True, max_mem_size=None, min_mem_size=None, strict_version_check=None, ignore_config=False, extra_classpath=None, jvm_custom_args=None, bind_to_localhost=True, **kwargs)
Attempt to connect to a local server, or if not successful start a new server and connect to it.
:param url: Full URL of the server to connect to (can be used instead of `ip` + `port` + `https`).
:param ip: The ip address (or host name) of the server where H2O is running.
:param port: Port number that H2O service is listening to.
:param name: Cluster name. If None while connecting to an existing cluster it will not check the cluster name.
If set then will connect only if the target cluster name matches. If no instance is found and decides to start a local
one then this will be used as the cluster name or a random one will be generated if set to None.
:param https: Set to True to connect via https:// instead of http://.
:param cacert: Path to a CA bundle file or a directory with certificates of trusted CAs (optional).
:param insecure: When using https, setting this to True will disable SSL certificates verification.
:param username: Username and
:param password: Password for basic authentication.
:param cookies: Cookie (or list of) to add to each request.
:param proxy: Proxy server address.
:param start_h2o: If False, do not attempt to start an h2o server when connection to an existing one failed.
:param nthreads: "Number of threads" option when launching a new h2o server.
:param ice_root: Directory for temporary files for the new h2o server.
:param log_dir: Directory for H2O logs to be stored if a new instance is started. Ignored if connecting to an existing node.
:param log_level: The logger level for H2O if a new instance is started. One of TRACE,DEBUG,INFO,WARN,ERRR,FATA. Default is INFO. Ignored if connecting to an existing node.
:param max_log_file_size: Maximum size of INFO and DEBUG log files. The file is rolled over after a specified size has been reached. (The default is 3MB. Minimum is 1MB and maximum is 99999MB)
:param enable_assertions: Enable assertions in Java for the new h2o server.
:param max_mem_size: Maximum memory to use for the new h2o server. Integer input will be evaluated as gigabytes. Other units can be specified by passing in a string (e.g. "160M" for 160 megabytes).
- **Note:** If `max_mem_size` is not defined, then the amount of memory that H2O allocates will be determined by the default memory of the Java Virtual Machine (JVM). This amount depends on the Java version, but it will generally be 25% of the machine's physical memory.
:param min_mem_size: Minimum memory to use for the new h2o server. Integer input will be evaluated as gigabytes. Other units can be specified by passing in a string (e.g. "160M" for 160 megabytes).
:param strict_version_check: If True, an error will be raised if the client and server versions don't match.
:param ignore_config: Indicates whether a processing of a .h2oconfig file should be conducted or not. Default value is False.
:param extra_classpath: List of paths to libraries that should be included on the Java classpath when starting H2O from Python.
:param kwargs: (all other deprecated attributes)
:param jvm_custom_args: Customer, user-defined argument's for the JVM H2O is instantiated in. Ignored if there is an instance of H2O already running and the client connects to it.
:param bind_to_localhost: A flag indicating whether access to the H2O instance should be restricted to the local machine (default) or if it can be reached from other computers on the network.
:examples:
>>> import h2o
>>> h2o.init(ip="localhost", port=54323)
interaction(data, factors, pairwise, max_factors, min_occurrence, destination_frame=None)
Categorical Interaction Feature Creation in H2O.
Creates a frame in H2O with n-th order interaction features between categorical columns, as specified by
the user.
:param data: the H2OFrame that holds the target categorical columns.
:param factors: factor columns (either indices or column names).
:param pairwise: If True, create pairwise interactions between factors (otherwise create one
higher-order interaction). Only applicable if there are 3 or more factors.
:param max_factors: Max. number of factor levels in pair-wise interaction terms (if enforced, one extra
catch-all factor will be made).
:param min_occurrence: Min. occurrence threshold for factor levels in pair-wise interaction terms
:param destination_frame: a string indicating the destination key. If empty, this will be auto-generated by H2O.
:returns: :class:`H2OFrame`
:examples:
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> iris = iris.cbind(iris[4] == "Iris-setosa")
>>> iris[5] = iris[5].asfactor()
>>> iris.set_name(5,"C6")
>>> iris = iris.cbind(iris[4] == "Iris-virginica")
>>> iris[6] = iris[6].asfactor()
>>> iris.set_name(6, name="C7")
>>> two_way_interactions = h2o.interaction(iris,
... factors=[4,5,6],
... pairwise=True,
... max_factors=10000,
... min_occurrence=1)
>>> from h2o.utils.typechecks import assert_is_type
>>> assert_is_type(two_way_interactions, H2OFrame)
>>> levels1 = two_way_interactions.levels()[0]
>>> levels2 = two_way_interactions.levels()[1]
>>> levels3 = two_way_interactions.levels()[2]
>>> two_way_interactions
is_expr_optimizations_enabled()
:examples:
>>> h2o.enable_expr_optimizations(True)
>>> h2o.is_expr_optimizations_enabled()
>>> h2o.enable_expr_optimizations(False)
>>> h2o.is_expr_optimizations_enabled()
lazy_import(path, pattern=None)
Import a single file or collection of files.
:param path: A path to a data file (remote or local).
:param pattern: Character string containing a regular expression to match file(s) in the folder.
:returns: either a :class:`H2OFrame` with the content of the provided file, or a list of such frames if
importing multiple files.
:examples:
>>> iris = h2o.lazy_import("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
list_timezones()
Deprecated, use ``h2o.cluster().list_timezones()``.
Deprecated.
load_dataset(relative_path)
Imports a data file within the 'h2o_data' folder.
:examples:
>>> fr = h2o.load_dataset("iris")
load_grid(grid_file_path, load_params_references=False)
Loads previously saved grid with all its models from the same folder
:param grid_file_path: A string containing the path to the file with grid saved.
Grid models are expected to be in the same folder.
:param load_params_references: If true will attemt to reload saved objects referenced by grid parameters
(e.g. training frame, calibration frame), will fail if grid was saved without referenced objects.
:return: An instance of H2OGridSearch
:examples:
>>> from collections import OrderedDict
>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Run GBM Grid Search
>>> ntrees_opts = [1, 3]
>>> learn_rate_opts = [0.1, 0.01, .05]
>>> hyper_parameters = OrderedDict()
>>> hyper_parameters["learn_rate"] = learn_rate_opts
>>> hyper_parameters["ntrees"] = ntrees_opts
>>> export_dir = pyunit_utils.locate("results")
>>> gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params=hyper_parameters)
>>> gs.train(x=list(range(4)), y=4, training_frame=train)
>>> grid_id = gs.grid_id
>>> old_grid_model_count = len(gs.model_ids)
# Save the grid search to the export directory
>>> saved_path = h2o.save_grid(export_dir, grid_id)
>>> h2o.remove_all();
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Load the grid searcht-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> grid = h2o.load_grid(saved_path)
>>> grid.train(x=list(range(4)), y=4, training_frame=train)
load_model(path)
Load a saved H2O model from disk. (Note that ensemble binary models can now be loaded using this method.)
:param path: the full path of the H2O Model to be imported.
:returns: an :class:`H2OEstimator` object
:examples:
>>> training_data = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
... "DayOfWeek", "Month", "Distance", "FlightNum"]
>>> response = "IsDepDelayed"
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
... alpha=0,
... Lambda=1e-5)
>>> model.train(x=predictors,
... y=response,
... training_frame=training_data)
>>> h2o.save_model(model, path='', force=True)
>>> h2o.load_model(model)
log_and_echo(message='')
Log a message on the server-side logs.
This is helpful when running several pieces of work one after the other on a single H2O
cluster and you want to make a notation in the H2O server side log where one piece of
work ends and the next piece of work begins.
Sends a message to H2O for logging. Generally used for debugging purposes.
:param message: message to write to the log.
:examples:
>>> ret = h2o.log_and_echo("Testing h2o.log_and_echo")
ls()
List keys on an H2O Cluster.
:examples:
>>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
>>> h2o.ls()
make_metrics(predicted, actual, domain=None, distribution=None, weights=None, auc_type='NONE')
Create Model Metrics from predicted and actual values in H2O.
:param H2OFrame predicted: an H2OFrame containing predictions.
:param H2OFrame actuals: an H2OFrame containing actual values.
:param domain: list of response factors for classification.
:param distribution: distribution for regression.
:param H2OFrame weights: an H2OFrame containing observation weights (optional).
:param auc_type: auc For multinomial classification you have to specify which type of agregated AUC/AUCPR
will be used to calculate this metric. Possibilities are MACRO_OVO, MACRO_OVR, WEIGHTED_OVO, WEIGHTED_OVR,
NONE and AUTO (OVO = One vs. One, OVR = One vs. Rest). Default is "NONE" (AUC and AUCPR are not calculated).
:examples:
>>> fr = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> fr["CAPSULE"] = fr["CAPSULE"].asfactor()
>>> fr["RACE"] = fr["RACE"].asfactor()
>>> response = "AGE"
>>> predictors = list(set(fr.names) - {"ID", response})
>>> for distr in ["gaussian", "poisson", "laplace", "gamma"]:
... print("distribution: %s" % distr)
... model = H2OGradientBoostingEstimator(distribution=distr,
... ntrees=2,
... max_depth=3,
... min_rows=1,
... learn_rate=0.1,
... nbins=20)
... model.train(x=predictors,
... y=response,
... training_frame=fr)
... predicted = h2o.assign(model.predict(fr), "pred")
... actual = fr[response]
... m0 = model.model_performance(train=True)
... m1 = h2o.make_metrics(predicted, actual, distribution=distr)
... m2 = h2o.make_metrics(predicted, actual)
>>> print(m0)
>>> print(m1)
>>> print(m2)
model_correlation_heatmap(models, frame, top_n=None, cluster_models=True, triangular=True, figsize=(13, 13), colormap='RdYlBu_r')
Model Prediction Correlation Heatmap
This plot shows the correlation between the predictions of the models.
For classification, frequency of identical predictions is used. By default, models
are ordered by their similarity (as computed by hierarchical clustering).
:param models: a list of H2O models, an H2O AutoML instance, or an H2OFrame with a 'model_id' column (e.g. H2OAutoML leaderboard)
:param frame: H2OFrame
:param top_n: DEPRECATED. show just top n models (applies only when used with H2OAutoML).
:param cluster_models: if True, cluster the models
:param triangular: make the heatmap triangular
:param figsize: figsize: figure size; passed directly to matplotlib
:param colormap: colormap to use
:returns: a matplotlib figure object
:examples:
>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the model correlation heatmap
>>> aml.model_correlation_heatmap(test)
models()
Retrieve the IDs all the Models.
:returns: Handles of all the models present in the cluster
:examples:
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> airlines["Year"]= airlines["Year"].asfactor()
>>> airlines["Month"]= airlines["Month"].asfactor()
>>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
>>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
>>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
>>> model1 = H2OGeneralizedLinearEstimator(family="binomial")
>>> model1.train(y=response, training_frame=airlines)
>>> model2 = H2OXGBoostEstimator(family="binomial")
>>> model2.train(y=response, training_frame=airlines)
>>> model_list = h2o.get_models()
mojo_predict_csv(input_csv_path, mojo_zip_path, output_csv_path=None, genmodel_jar_path=None, classpath=None, java_options=None, verbose=False, setInvNumNA=False, predict_contributions=False)
MOJO scoring function to take a CSV file and use MOJO model as zip file to score.
:param input_csv_path: Path to input CSV file.
:param mojo_zip_path: Path to MOJO zip downloaded from H2O.
:param output_csv_path: Optional, name of the output CSV file with computed predictions. If None (default), then
predictions will be saved as prediction.csv in the same folder as the MOJO zip.
:param genmodel_jar_path: Optional, path to genmodel jar file. If None (default) then the h2o-genmodel.jar in the same
folder as the MOJO zip will be used.
:param classpath: Optional, specifies custom user defined classpath which will be used when scoring. If None
(default) then the default classpath for this MOJO model will be used.
:param java_options: Optional, custom user defined options for Java. By default ``-Xmx4g -XX:ReservedCodeCacheSize=256m`` is used.
:param verbose: Optional, if True, then additional debug information will be printed. False by default.
:param predict_contributions: if True, then return prediction contributions instead of regular predictions
(only for tree-based models).
:return: List of computed predictions
mojo_predict_pandas(dataframe, mojo_zip_path, genmodel_jar_path=None, classpath=None, java_options=None, verbose=False, setInvNumNA=False, predict_contributions=False)
MOJO scoring function to take a Pandas frame and use MOJO model as zip file to score.
:param dataframe: Pandas frame to score.
:param mojo_zip_path: Path to MOJO zip downloaded from H2O.
:param genmodel_jar_path: Optional, path to genmodel jar file. If None (default) then the h2o-genmodel.jar in the same
folder as the MOJO zip will be used.
:param classpath: Optional, specifies custom user defined classpath which will be used when scoring. If None
(default) then the default classpath for this MOJO model will be used.
:param java_options: Optional, custom user defined options for Java. By default ``-Xmx4g`` is used.
:param verbose: Optional, if True, then additional debug information will be printed. False by default.
:param predict_contributions: if True, then return prediction contributions instead of regular predictions
(only for tree-based models).
:return: Pandas frame with predictions
network_test()
Deprecated, use ``h2o.cluster().network_test()``.
Deprecated.
no_progress()
Disable the progress bar from flushing to stdout.
The completed progress bar is printed when a job is complete so as to demarcate a log file.
:examples:
>>> h2o.no_progress()
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> x = ["DayofMonth", "Month"]
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
... alpha=0,
... Lambda=1e-5)
>>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)
parse_raw(setup, id=None, first_line_is_header=0)
Parse dataset using the parse setup structure.
:param setup: Result of ``h2o.parse_setup()``
:param id: an id for the frame.
:param first_line_is_header: -1, 0, 1 if the first line is to be used as the header
:returns: an :class:`H2OFrame` object.
:examples:
>>> fraw = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"),
... parse=False)
>>> fhex = h2o.parse_raw(h2o.parse_setup(fraw),
... id='prostate.csv',
... first_line_is_header=0)
>>> fhex.summary()
parse_setup(raw_frames, destination_frame=None, header=0, separator=None, column_names=None, column_types=None, na_strings=None, skipped_columns=None, custom_non_data_line_markers=None, partition_by=None, quotechar=None, escapechar=None)
Retrieve H2O's best guess as to what the structure of the data file is.
During parse setup, the H2O cluster will make several guesses about the attributes of
the data. This method allows a user to perform corrective measures by updating the
returning dictionary from this method. This dictionary is then fed into `parse_raw` to
produce the H2OFrame instance.
:param raw_frames: a collection of imported file frames
:param destination_frame: The unique hex key assigned to the imported file. If none is given, a key will
automatically be generated.
:param header: -1 means the first line is data, 0 means guess, 1 means first line is header.
:param separator: The field separator character. Values on each line of the file are separated by
this character. If not provided, the parser will automatically detect the separator.
:param column_names: A list of column names for the file. If skipped_columns are specified, only list column names
of columns that are not skipped.
:param column_types: A list of types or a dictionary of column names to types to specify whether columns
should be forced to a certain type upon import parsing. If a list, the types for elements that are
one will be guessed. If skipped_columns are specified, only list column types of columns that are not skipped.
The possible types a column may have are:
- "unknown" - this will force the column to be parsed as all NA
- "uuid" - the values in the column must be true UUID or will be parsed as NA
- "string" - force the column to be parsed as a string
- "numeric" - force the column to be parsed as numeric. H2O will handle the compression of the numeric
data in the optimal manner.
- "enum" - force the column to be parsed as a categorical column.
- "time" - force the column to be parsed as a time column. H2O will attempt to parse the following
list of date time formats: (date) "yyyy-MM-dd", "yyyy MM dd", "dd-MMM-yy", "dd MMM yy", (time)
"HH:mm:ss", "HH:mm:ss:SSS", "HH:mm:ss:SSSnnnnnn", "HH.mm.ss" "HH.mm.ss.SSS", "HH.mm.ss.SSSnnnnnn".
Times can also contain "AM" or "PM".
:param na_strings: A list of strings, or a list of lists of strings (one list per column), or a dictionary
of column names to strings which are to be interpreted as missing values.
:param skipped_columns: an integer lists of column indices to skip and not parsed into the final frame from the import file.
:param custom_non_data_line_markers: If a line in imported file starts with any character in given string it will NOT be imported. Empty string means all lines are imported, None means that default behaviour for given format will be used
:param partition_by: A list of columns the dataset has been partitioned by. None by default.
:param quotechar: A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
:param escapechar: (Optional) One ASCII character used to escape other characters.
:returns: a dictionary containing parse parameters guessed by the H2O backend.
:examples:
>>> col_headers = ["ID","CAPSULE","AGE","RACE",
... "DPROS","DCAPS","PSA","VOL","GLEASON"]
>>> col_types=['enum','enum','numeric','enum',
... 'enum','enum','numeric','numeric','numeric']
>>> hex_key = "training_data.hex"
>>> fraw = h2o.import_file(("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip"),
... parse=False)
>>> setup = h2o.parse_setup(fraw,
... destination_frame=hex_key,
... header=1,
... separator=',',
... column_names=col_headers,
... column_types=col_types,
... na_strings=["NA"])
>>> setup
pd_multi_plot(models, frame, column, best_of_family=True, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2', markers=['o', 'v', 's', 'P', '*', 'D', 'X', '^', '<', '>', '.'])
Plot partial dependencies of a variable across multiple models.
Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable
on the response. The effect of a variable is measured in change in the mean response.
PDP assumes independence between the feature for which is the PDP computed and the rest.
:param models: a list of H2O models, an H2O AutoML instance, or an H2OFrame with a 'model_id' column (e.g. H2OAutoML leaderboard)
:param frame: H2OFrame
:param column: string containing column name
:param best_of_family: if True, show only the best models per family
:param row_index: if None, do partial dependence, if integer, do individual
conditional expectation for the row specified by this integer
:param target: (only for multinomial classification) for what target should the plot be done
:param max_levels: maximum number of factor levels to show
:param figsize: figure size; passed directly to matplotlib
:param colormap: colormap name
:param markers: List of markers to use for factors, when it runs out of possible markers the last in
this list will get reused
:returns: a matplotlib figure object
:examples:
>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create a partial dependence plot
>>> aml.pd_multi_plot(test, column="alcohol")
print_mojo(mojo_path, format='json', tree_index=None)
Generates string representation of an existing MOJO model.
:param mojo_path: Path to the MOJO archive on the user's local filesystem
:param format: Output format. Possible values: json (default), dot, png
:param tree_index: Index of tree to print
:return: An string representation of the MOJO for text output formats,
a path to a directory with the rendered images for image output formats
(or a path to a file if only a single tree is outputted)
:example:
>>> import json
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> prostate = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")
>>> prostate["CAPSULE"] = prostate["CAPSULE"].asfactor()
>>> gbm_h2o = H2OGradientBoostingEstimator(ntrees = 5,
... learn_rate = 0.1,
... max_depth = 4,
... min_rows = 10)
>>> gbm_h2o.train(x = list(range(1,prostate.ncol)),
... y = "CAPSULE",
... training_frame = prostate)
>>> mojo_path = gbm_h2o.download_mojo()
>>> mojo_str = h2o.print_mojo(mojo_path)
>>> mojo_dict = json.loads(mojo_str)
rapids(expr)
Execute a Rapids expression.
:param expr: The rapids expression (ascii string).
:returns: The JSON response (as a python dictionary) of the Rapids execution.
:examples:
>>> rapidTime = h2o.rapids("(getTimeZone)")["string"]
>>> print(str(rapidTime))
remove(x, cascade=True)
Remove object(s) from H2O.
:param x: H2OFrame, H2OEstimator, or string, or a list of those things: the object(s) or unique id(s)
pointing to the object(s) to be removed.
:param cascade: boolean, if set to TRUE (default), the object dependencies (e.g. submodels) are also removed.
:examples:
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> h2o.remove(airlines)
>>> airlines
# Should receive error: "This H2OFrame has been removed."
remove_all(retained=None)
Removes all objects from H2O with possibility to specify models and frames to retain.
Retained keys must be keys of models and frames only. For models retained, training and validation frames are retained as well.
Cross validation models of a retained model are NOT retained automatically, those must be specified explicitely.
:param retained: Keys of models and frames to retain
:examples:
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> gbm = H2OGradientBoostingEstimator(ntrees = 1)
>>> gbm.train(x = ["Origin", "Dest"],
... y = "IsDepDelayed",
... training_frame=airlines)
>>> h2o.remove_all([airlines.frame_id,
... gbm.model_id])
resume(recovery_dir=None)
Triggers auto-recovery resume - this will look into configured recovery dir and resume and
tasks that were interrupted by unexpected cluster stopping.
:param recovery_dir: A path to where cluster recovery data is stored, if blank, will use cluster's configuration.
save_grid(grid_directory, grid_id, save_params_references=False, export_cross_validation_predictions=False)
Export a Grid and it's all its models into the given folder
:param grid_directory: A string containing the path to the folder for the grid to be saved to.
:param grid_id: A character string with identification of the Grid in H2O.
:param save_params_references: True if objects referenced by grid parameters
(e.g. training frame, calibration frame) should also be saved.
:param export_cross_validation_predictions: A boolean flag indicating whether the models exported from the grid
should be saved with CV Holdout Frame predictions. Default is not to export the predictions.
:examples:
>>> from collections import OrderedDict
>>> from h2o.grid.grid_search import H2OGridSearch
>>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Run GBM Grid Search
>>> ntrees_opts = [1, 3]
>>> learn_rate_opts = [0.1, 0.01, .05]
>>> hyper_parameters = OrderedDict()
>>> hyper_parameters["learn_rate"] = learn_rate_opts
>>> hyper_parameters["ntrees"] = ntrees_opts
>>> export_dir = pyunit_utils.locate("results")
>>> gs = H2OGridSearch(H2OGradientBoostingEstimator, hyper_params=hyper_parameters)
>>> gs.train(x=list(range(4)), y=4, training_frame=train)
>>> grid_id = gs.grid_id
>>> old_grid_model_count = len(gs.model_ids)
# Save the grid search to the export directory
>>> saved_path = h2o.save_grid(export_dir, grid_id)
>>> h2o.remove_all();
>>> train = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris_wheader.csv")
# Load the grid search
>>> grid = h2o.load_grid(saved_path)
>>> grid.train(x=list(range(4)), y=4, training_frame=train)
save_model(model, path='', force=False, export_cross_validation_predictions=False, filename=None)
Save an H2O Model object to disk. (Note that ensemble binary models can now be saved using this method.)
The owner of the file saved is the user by which H2O cluster was executed.
:param model: The model object to save.
:param path: a path to save the model at (hdfs, s3, local)
:param force: if True overwrite destination directory in case it exists, or throw exception if set to False.
:param export_cross_validation_predictions: logical, indicates whether the exported model
artifact should also include CV Holdout Frame predictions. Default is not to export the predictions.
:param filename: a filename for the saved model
:returns: the path of the saved model
:examples:
>>> from h2o.estimators.glm import H2OGeneralizedLinearEstimator
>>> h2o_df = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv.zip")
>>> my_model = H2OGeneralizedLinearEstimator(family = "binomial")
>>> my_model.train(y = "CAPSULE",
... x = ["AGE", "RACE", "PSA", "GLEASON"],
... training_frame = h2o_df)
>>> h2o.save_model(my_model, path='', force=True)
set_timezone(value)
Deprecated, set ``h2o.cluster().timezone`` instead.
Deprecated.
show_progress()
Enable the progress bar (it is enabled by default).
:examples:
>>> h2o.no_progress()
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> x = ["DayofMonth", "Month"]
>>> model = H2OGeneralizedLinearEstimator(family="binomial",
... alpha=0,
... Lambda=1e-5)
>>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)
>>> h2o.show_progress()
>>> model.train(x=x, y="IsDepDelayed", training_frame=airlines)
shutdown(prompt=False)
Deprecated, use ``h2o.cluster().shutdown()``.
Deprecated.
upload_custom_distribution(func, func_file='distributions.py', func_name=None, class_name=None, source_provider=None)
upload_custom_metric(func, func_file='metrics.py', func_name=None, class_name=None, source_provider=None)
Upload given metrics function into H2O cluster.
The metrics can have different representation:
- class: needs to implement map(pred, act, weight, offset, model), reduce(l, r) and metric(l) methods
- string: the same as in class case, but the class is given as a string
:param func: metric representation: string, class
:param func_file: internal name of file to save given metrics representation
:param func_name: name for h2o key under which the given metric is saved
:param class_name: name of class wrapping the metrics function (when supplied as string)
:param source_provider: a function which provides a source code for given function
:return: reference to uploaded metrics function
:examples:
>>> class CustomMaeFunc:
>>> def map(self, pred, act, w, o, model):
>>> return [abs(act[0] - pred[0]), 1]
>>>
>>> def reduce(self, l, r):
>>> return [l[0] + r[0], l[1] + r[1]]
>>>
>>> def metric(self, l):
>>> return l[0] / l[1]
>>>
>>> custom_func_str = '''class CustomMaeFunc:
>>> def map(self, pred, act, w, o, model):
>>> return [abs(act[0] - pred[0]), 1]
>>>
>>> def reduce(self, l, r):
>>> return [l[0] + r[0], l[1] + r[1]]
>>>
>>> def metric(self, l):
>>> return l[0] / l[1]'''
>>>
>>>
>>> h2o.upload_custom_metric(custom_func_str, class_name="CustomMaeFunc", func_name="mae")
upload_file(path, destination_frame=None, header=0, sep=None, col_names=None, col_types=None, na_strings=None, skipped_columns=None, quotechar=None, escapechar=None)
Upload a dataset from the provided local path to the H2O cluster.
Does a single-threaded push to H2O. Also see :meth:`import_file`.
:param path: A path specifying the location of the data to upload.
:param destination_frame: The unique hex key assigned to the imported file. If none is given, a key will
be automatically generated.
:param header: -1 means the first line is data, 0 means guess, 1 means first line is header.
:param sep: The field separator character. Values on each line of the file are separated by
this character. If not provided, the parser will automatically detect the separator.
:param col_names: A list of column names for the file.
:param col_types: A list of types or a dictionary of column names to types to specify whether columns
should be forced to a certain type upon import parsing. If a list, the types for elements that are
one will be guessed. The possible types a column may have are:
- "unknown" - this will force the column to be parsed as all NA
- "uuid" - the values in the column must be true UUID or will be parsed as NA
- "string" - force the column to be parsed as a string
- "numeric" - force the column to be parsed as numeric. H2O will handle the compression of the numeric
data in the optimal manner.
- "enum" - force the column to be parsed as a categorical column.
- "time" - force the column to be parsed as a time column. H2O will attempt to parse the following
list of date time formats: (date) "yyyy-MM-dd", "yyyy MM dd", "dd-MMM-yy", "dd MMM yy", (time)
"HH:mm:ss", "HH:mm:ss:SSS", "HH:mm:ss:SSSnnnnnn", "HH.mm.ss" "HH.mm.ss.SSS", "HH.mm.ss.SSSnnnnnn".
Times can also contain "AM" or "PM".
:param na_strings: A list of strings, or a list of lists of strings (one list per column), or a dictionary
of column names to strings which are to be interpreted as missing values.
:param skipped_columns: an integer lists of column indices to skip and not parsed into the final frame from the import file.
:param quotechar: A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
:param escapechar: (Optional) One ASCII character used to escape other characters.
:returns: a new :class:`H2OFrame` instance.
:examples:
>>> iris_df = h2o.upload_file("~/Desktop/repos/h2o-3/smalldata/iris/iris.csv")
upload_model(path)
Upload a binary model from the provided local path to the H2O cluster.
(H2O model can be saved in a binary form either by save_model() or by download_model() function.)
:param path: A path on the machine this python session is currently connected to, specifying the location of the model to upload.
:returns: a new :class:`H2OEstimator` object.
upload_mojo(mojo_path)
Uploads an existing MOJO model from local filesystem into H2O and imports it as an H2O Generic Model.
:param mojo_path: Path to the MOJO archive on the user's local filesystem
:return: An H2OGenericEstimator instance embedding given MOJO
:examples:
>>> from h2o.estimators import H2OGradientBoostingEstimator
>>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
>>> model = H2OGradientBoostingEstimator(ntrees = 1)
>>> model.train(x = ["Origin", "Dest"],
... y = "IsDepDelayed",
... training_frame=airlines)
>>> original_model_filename = tempfile.mkdtemp()
>>> original_model_filename = model.download_mojo(original_model_filename)
>>> mojo_model = h2o.upload_mojo(original_model_filename)
varimp_heatmap(models, top_n=None, figsize=(16, 9), cluster=True, colormap='RdYlBu_r')
Variable Importance Heatmap across a group of models
Variable importance heatmap shows variable importance across multiple models.
Some models in H2O return variable importance for one-hot (binary indicator)
encoded versions of categorical columns (e.g. Deep Learning, XGBoost). In order
for the variable importance of categorical columns to be compared across all model
types we compute a summarization of the the variable importance across all one-hot
encoded features and return a single variable importance for the original categorical
feature. By default, the models and variables are ordered by their similarity.
:param models: a list of H2O models, an H2O AutoML instance, or an H2OFrame with a 'model_id' column (e.g. H2OAutoML leaderboard)
:param top_n: DEPRECATED. use just top n models (applies only when used with H2OAutoML)
:param figsize: figsize: figure size; passed directly to matplotlib
:param cluster: if True, cluster the models and variables
:param colormap: colormap to use
:returns: a matplotlib figure object
:examples:
>>> import h2o
>>> from h2o.automl import H2OAutoML
>>>
>>> h2o.init()
>>>
>>> # Import the wine dataset into H2O:
>>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
>>> df = h2o.import_file(f)
>>>
>>> # Set the response
>>> response = "quality"
>>>
>>> # Split the dataset into a train and test set:
>>> train, test = df.split_frame([0.8])
>>>
>>> # Train an H2OAutoML
>>> aml = H2OAutoML(max_models=10)
>>> aml.train(y=response, training_frame=train)
>>>
>>> # Create the variable importance heatmap
>>> aml.varimp_heatmap()
DATA
__all__ = ['connect', 'init', 'api', 'connection', 'resume', 'upload_f...
__buildinfo__ = "versionFromGradle='3.34.0',projectVersion='3.34....il...
VERSION
3.34.0.3
FILE
c:\users\admin\anaconda3\lib\site-packages\h2o\__init__.py
Help on class H2ODeepLearningEstimator in module h2o.estimators.deeplearning:
class H2ODeepLearningEstimator(h2o.estimators.estimator_base.H2OEstimator)
| H2ODeepLearningEstimator(model_id=None, training_frame=None, validation_frame=None, nfolds=0, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, weights_column=None, offset_column=None, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, checkpoint=None, pretrained_autoencoder=None, overwrite_with_best_model=True, use_all_factor_levels=True, standardize=True, activation='rectifier', hidden=[200, 200], epochs=10.0, train_samples_per_iteration=-2, target_ratio_comm_to_comp=0.05, seed=-1, adaptive_rate=True, rho=0.99, epsilon=1e-08, rate=0.005, rate_annealing=1e-06, rate_decay=1.0, momentum_start=0.0, momentum_ramp=1000000.0, momentum_stable=0.0, nesterov_accelerated_gradient=True, input_dropout_ratio=0.0, hidden_dropout_ratios=None, l1=0.0, l2=0.0, max_w2=3.4028235e+38, initial_weight_distribution='uniform_adaptive', initial_weight_scale=1.0, initial_weights=None, initial_biases=None, loss='automatic', distribution='auto', quantile_alpha=0.5, tweedie_power=1.5, huber_alpha=0.9, score_interval=5.0, score_training_samples=10000, score_validation_samples=0, score_duty_cycle=0.1, classification_stop=0.0, regression_stop=1e-06, stopping_rounds=5, stopping_metric='auto', stopping_tolerance=0.0, max_runtime_secs=0.0, score_validation_sampling='uniform', diagnostics=True, fast_mode=True, force_load_balance=True, variable_importances=True, replicate_training_data=True, single_node_mode=False, shuffle_training_data=False, missing_values_handling='mean_imputation', quiet_mode=False, autoencoder=False, sparse=False, col_major=False, average_activation=0.0, sparsity_beta=0.0, max_categorical_features=2147483647, reproducible=False, export_weights_and_biases=False, mini_batch_size=1, categorical_encoding='auto', elastic_averaging=False, elastic_averaging_moving_rate=0.9, elastic_averaging_regularization=0.001, export_checkpoints_dir=None, auc_type='auto')
|
| Deep Learning
|
| Build a Deep Neural Network model using CPUs
| Builds a feed-forward multilayer artificial neural network on an H2OFrame
|
| :examples:
|
| >>> from h2o.estimators.deeplearning import H2ODeepLearningEstimator
| >>> rows = [[1,2,3,4,0], [2,1,2,4,1], [2,1,4,2,1],
| ... [0,1,2,34,1], [2,3,4,1,0]] * 50
| >>> fr = h2o.H2OFrame(rows)
| >>> fr[4] = fr[4].asfactor()
| >>> model = H2ODeepLearningEstimator()
| >>> model.train(x=range(4), y=4, training_frame=fr)
| >>> model.logloss()
|
| Method resolution order:
| H2ODeepLearningEstimator
| h2o.estimators.estimator_base.H2OEstimator
| h2o.model.model_base.ModelBase
| h2o.model.model_base.ModelBase
| h2o.base.Keyed
| builtins.object
|
| Methods defined here:
|
| __init__(self, model_id=None, training_frame=None, validation_frame=None, nfolds=0, keep_cross_validation_models=True, keep_cross_validation_predictions=False, keep_cross_validation_fold_assignment=False, fold_assignment='auto', fold_column=None, response_column=None, ignored_columns=None, ignore_const_cols=True, score_each_iteration=False, weights_column=None, offset_column=None, balance_classes=False, class_sampling_factors=None, max_after_balance_size=5.0, max_confusion_matrix_size=20, checkpoint=None, pretrained_autoencoder=None, overwrite_with_best_model=True, use_all_factor_levels=True, standardize=True, activation='rectifier', hidden=[200, 200], epochs=10.0, train_samples_per_iteration=-2, target_ratio_comm_to_comp=0.05, seed=-1, adaptive_rate=True, rho=0.99, epsilon=1e-08, rate=0.005, rate_annealing=1e-06, rate_decay=1.0, momentum_start=0.0, momentum_ramp=1000000.0, momentum_stable=0.0, nesterov_accelerated_gradient=True, input_dropout_ratio=0.0, hidden_dropout_ratios=None, l1=0.0, l2=0.0, max_w2=3.4028235e+38, initial_weight_distribution='uniform_adaptive', initial_weight_scale=1.0, initial_weights=None, initial_biases=None, loss='automatic', distribution='auto', quantile_alpha=0.5, tweedie_power=1.5, huber_alpha=0.9, score_interval=5.0, score_training_samples=10000, score_validation_samples=0, score_duty_cycle=0.1, classification_stop=0.0, regression_stop=1e-06, stopping_rounds=5, stopping_metric='auto', stopping_tolerance=0.0, max_runtime_secs=0.0, score_validation_sampling='uniform', diagnostics=True, fast_mode=True, force_load_balance=True, variable_importances=True, replicate_training_data=True, single_node_mode=False, shuffle_training_data=False, missing_values_handling='mean_imputation', quiet_mode=False, autoencoder=False, sparse=False, col_major=False, average_activation=0.0, sparsity_beta=0.0, max_categorical_features=2147483647, reproducible=False, export_weights_and_biases=False, mini_batch_size=1, categorical_encoding='auto', elastic_averaging=False, elastic_averaging_moving_rate=0.9, elastic_averaging_regularization=0.001, export_checkpoints_dir=None, auc_type='auto')
| :param model_id: Destination id for this model; auto-generated if not specified.
| Defaults to ``None``.
| :type model_id: Union[None, str, H2OEstimator], optional
| :param training_frame: Id of the training data frame.
| Defaults to ``None``.
| :type training_frame: Union[None, str, H2OFrame], optional
| :param validation_frame: Id of the validation data frame.
| Defaults to ``None``.
| :type validation_frame: Union[None, str, H2OFrame], optional
| :param nfolds: Number of folds for K-fold cross-validation (0 to disable or >= 2).
| Defaults to ``0``.
| :type nfolds: int
| :param keep_cross_validation_models: Whether to keep the cross-validation models.
| Defaults to ``True``.
| :type keep_cross_validation_models: bool
| :param keep_cross_validation_predictions: Whether to keep the predictions of the cross-validation models.
| Defaults to ``False``.
| :type keep_cross_validation_predictions: bool
| :param keep_cross_validation_fold_assignment: Whether to keep the cross-validation fold assignment.
| Defaults to ``False``.
| :type keep_cross_validation_fold_assignment: bool
| :param fold_assignment: Cross-validation fold assignment scheme, if fold_column is not specified. The
| 'Stratified' option will stratify the folds based on the response variable, for classification problems.
| Defaults to ``"auto"``.
| :type fold_assignment: Literal["auto", "random", "modulo", "stratified"]
| :param fold_column: Column with cross-validation fold index assignment per observation.
| Defaults to ``None``.
| :type fold_column: str, optional
| :param response_column: Response variable column.
| Defaults to ``None``.
| :type response_column: str, optional
| :param ignored_columns: Names of columns to ignore for training.
| Defaults to ``None``.
| :type ignored_columns: List[str], optional
| :param ignore_const_cols: Ignore constant columns.
| Defaults to ``True``.
| :type ignore_const_cols: bool
| :param score_each_iteration: Whether to score during each iteration of model training.
| Defaults to ``False``.
| :type score_each_iteration: bool
| :param weights_column: Column with observation weights. Giving some observation a weight of zero is equivalent
| to excluding it from the dataset; giving an observation a relative weight of 2 is equivalent to repeating
| that row twice. Negative weights are not allowed. Note: Weights are per-row observation weights and do
| not increase the size of the data frame. This is typically the number of times a row is repeated, but
| non-integer values are supported as well. During training, rows with higher weights matter more, due to
| the larger loss function pre-factor. If you set weight = 0 for a row, the returned prediction frame at
| that row is zero and this is incorrect. To get an accurate prediction, remove all rows with weight == 0.
| Defaults to ``None``.
| :type weights_column: str, optional
| :param offset_column: Offset column. This will be added to the combination of columns before applying the link
| function.
| Defaults to ``None``.
| :type offset_column: str, optional
| :param balance_classes: Balance training data class counts via over/under-sampling (for imbalanced data).
| Defaults to ``False``.
| :type balance_classes: bool
| :param class_sampling_factors: Desired over/under-sampling ratios per class (in lexicographic order). If not
| specified, sampling factors will be automatically computed to obtain class balance during training.
| Requires balance_classes.
| Defaults to ``None``.
| :type class_sampling_factors: List[float], optional
| :param max_after_balance_size: Maximum relative size of the training data after balancing class counts (can be
| less than 1.0). Requires balance_classes.
| Defaults to ``5.0``.
| :type max_after_balance_size: float
| :param max_confusion_matrix_size: [Deprecated] Maximum size (# classes) for confusion matrices to be printed in
| the Logs.
| Defaults to ``20``.
| :type max_confusion_matrix_size: int
| :param checkpoint: Model checkpoint to resume training with.
| Defaults to ``None``.
| :type checkpoint: Union[None, str, H2OEstimator], optional
| :param pretrained_autoencoder: Pretrained autoencoder model to initialize this model with.
| Defaults to ``None``.
| :type pretrained_autoencoder: Union[None, str, H2OEstimator], optional
| :param overwrite_with_best_model: If enabled, override the final model with the best model found during
| training.
| Defaults to ``True``.
| :type overwrite_with_best_model: bool
| :param use_all_factor_levels: Use all factor levels of categorical variables. Otherwise, the first factor level
| is omitted (without loss of accuracy). Useful for variable importances and auto-enabled for autoencoder.
| Defaults to ``True``.
| :type use_all_factor_levels: bool
| :param standardize: If enabled, automatically standardize the data. If disabled, the user must provide properly
| scaled input data.
| Defaults to ``True``.
| :type standardize: bool
| :param activation: Activation function.
| Defaults to ``"rectifier"``.
| :type activation: Literal["tanh", "tanh_with_dropout", "rectifier", "rectifier_with_dropout", "maxout",
| "maxout_with_dropout"]
| :param hidden: Hidden layer sizes (e.g. [100, 100]).
| Defaults to ``[200, 200]``.
| :type hidden: List[int]
| :param epochs: How many times the dataset should be iterated (streamed), can be fractional.
| Defaults to ``10.0``.
| :type epochs: float
| :param train_samples_per_iteration: Number of training samples (globally) per MapReduce iteration. Special
| values are 0: one epoch, -1: all available data (e.g., replicated training data), -2: automatic.
| Defaults to ``-2``.
| :type train_samples_per_iteration: int
| :param target_ratio_comm_to_comp: Target ratio of communication overhead to computation. Only for multi-node
| operation and train_samples_per_iteration = -2 (auto-tuning).
| Defaults to ``0.05``.
| :type target_ratio_comm_to_comp: float
| :param seed: Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded.
| Defaults to ``-1``.
| :type seed: int
| :param adaptive_rate: Adaptive learning rate.
| Defaults to ``True``.
| :type adaptive_rate: bool
| :param rho: Adaptive learning rate time decay factor (similarity to prior updates).
| Defaults to ``0.99``.
| :type rho: float
| :param epsilon: Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress).
| Defaults to ``1e-08``.
| :type epsilon: float
| :param rate: Learning rate (higher => less stable, lower => slower convergence).
| Defaults to ``0.005``.
| :type rate: float
| :param rate_annealing: Learning rate annealing: rate / (1 + rate_annealing * samples).
| Defaults to ``1e-06``.
| :type rate_annealing: float
| :param rate_decay: Learning rate decay factor between layers (N-th layer: rate * rate_decay ^ (n - 1).
| Defaults to ``1.0``.
| :type rate_decay: float
| :param momentum_start: Initial momentum at the beginning of training (try 0.5).
| Defaults to ``0.0``.
| :type momentum_start: float
| :param momentum_ramp: Number of training samples for which momentum increases.
| Defaults to ``1000000.0``.
| :type momentum_ramp: float
| :param momentum_stable: Final momentum after the ramp is over (try 0.99).
| Defaults to ``0.0``.
| :type momentum_stable: float
| :param nesterov_accelerated_gradient: Use Nesterov accelerated gradient (recommended).
| Defaults to ``True``.
| :type nesterov_accelerated_gradient: bool
| :param input_dropout_ratio: Input layer dropout ratio (can improve generalization, try 0.1 or 0.2).
| Defaults to ``0.0``.
| :type input_dropout_ratio: float
| :param hidden_dropout_ratios: Hidden layer dropout ratios (can improve generalization), specify one value per
| hidden layer, defaults to 0.5.
| Defaults to ``None``.
| :type hidden_dropout_ratios: List[float], optional
| :param l1: L1 regularization (can add stability and improve generalization, causes many weights to become 0).
| Defaults to ``0.0``.
| :type l1: float
| :param l2: L2 regularization (can add stability and improve generalization, causes many weights to be small.
| Defaults to ``0.0``.
| :type l2: float
| :param max_w2: Constraint for squared sum of incoming weights per unit (e.g. for Rectifier).
| Defaults to ``3.4028235e+38``.
| :type max_w2: float
| :param initial_weight_distribution: Initial weight distribution.
| Defaults to ``"uniform_adaptive"``.
| :type initial_weight_distribution: Literal["uniform_adaptive", "uniform", "normal"]
| :param initial_weight_scale: Uniform: -value...value, Normal: stddev.
| Defaults to ``1.0``.
| :type initial_weight_scale: float
| :param initial_weights: A list of H2OFrame ids to initialize the weight matrices of this model with.
| Defaults to ``None``.
| :type initial_weights: List[Union[None, str, H2OFrame]], optional
| :param initial_biases: A list of H2OFrame ids to initialize the bias vectors of this model with.
| Defaults to ``None``.
| :type initial_biases: List[Union[None, str, H2OFrame]], optional
| :param loss: Loss function.
| Defaults to ``"automatic"``.
| :type loss: Literal["automatic", "cross_entropy", "quadratic", "huber", "absolute", "quantile"]
| :param distribution: Distribution function
| Defaults to ``"auto"``.
| :type distribution: Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace",
| "quantile", "huber"]
| :param quantile_alpha: Desired quantile for Quantile regression, must be between 0 and 1.
| Defaults to ``0.5``.
| :type quantile_alpha: float
| :param tweedie_power: Tweedie power for Tweedie regression, must be between 1 and 2.
| Defaults to ``1.5``.
| :type tweedie_power: float
| :param huber_alpha: Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must
| be between 0 and 1).
| Defaults to ``0.9``.
| :type huber_alpha: float
| :param score_interval: Shortest time interval (in seconds) between model scoring.
| Defaults to ``5.0``.
| :type score_interval: float
| :param score_training_samples: Number of training set samples for scoring (0 for all).
| Defaults to ``10000``.
| :type score_training_samples: int
| :param score_validation_samples: Number of validation set samples for scoring (0 for all).
| Defaults to ``0``.
| :type score_validation_samples: int
| :param score_duty_cycle: Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring).
| Defaults to ``0.1``.
| :type score_duty_cycle: float
| :param classification_stop: Stopping criterion for classification error fraction on training data (-1 to
| disable).
| Defaults to ``0.0``.
| :type classification_stop: float
| :param regression_stop: Stopping criterion for regression error (MSE) on training data (-1 to disable).
| Defaults to ``1e-06``.
| :type regression_stop: float
| :param stopping_rounds: Early stopping based on convergence of stopping_metric. Stop if simple moving average of
| length k of the stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
| Defaults to ``5``.
| :type stopping_rounds: int
| :param stopping_metric: Metric to use for early stopping (AUTO: logloss for classification, deviance for
| regression and anonomaly_score for Isolation Forest). Note that custom and custom_increasing can only be
| used in GBM and DRF with the Python client.
| Defaults to ``"auto"``.
| :type stopping_metric: Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group",
| "misclassification", "mean_per_class_error", "custom", "custom_increasing"]
| :param stopping_tolerance: Relative tolerance for metric-based stopping criterion (stop if relative improvement
| is not at least this much)
| Defaults to ``0.0``.
| :type stopping_tolerance: float
| :param max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable.
| Defaults to ``0.0``.
| :type max_runtime_secs: float
| :param score_validation_sampling: Method used to sample validation dataset for scoring.
| Defaults to ``"uniform"``.
| :type score_validation_sampling: Literal["uniform", "stratified"]
| :param diagnostics: Enable diagnostics for hidden layers.
| Defaults to ``True``.
| :type diagnostics: bool
| :param fast_mode: Enable fast mode (minor approximation in back-propagation).
| Defaults to ``True``.
| :type fast_mode: bool
| :param force_load_balance: Force extra load balancing to increase training speed for small datasets (to keep all
| cores busy).
| Defaults to ``True``.
| :type force_load_balance: bool
| :param variable_importances: Compute variable importances for input features (Gedeon method) - can be slow for
| large networks.
| Defaults to ``True``.
| :type variable_importances: bool
| :param replicate_training_data: Replicate the entire training dataset onto every node for faster training on
| small datasets.
| Defaults to ``True``.
| :type replicate_training_data: bool
| :param single_node_mode: Run on a single node for fine-tuning of model parameters.
| Defaults to ``False``.
| :type single_node_mode: bool
| :param shuffle_training_data: Enable shuffling of training data (recommended if training data is replicated and
| train_samples_per_iteration is close to #nodes x #rows, of if using balance_classes).
| Defaults to ``False``.
| :type shuffle_training_data: bool
| :param missing_values_handling: Handling of missing values. Either MeanImputation or Skip.
| Defaults to ``"mean_imputation"``.
| :type missing_values_handling: Literal["mean_imputation", "skip"]
| :param quiet_mode: Enable quiet mode for less output to standard output.
| Defaults to ``False``.
| :type quiet_mode: bool
| :param autoencoder: Auto-Encoder.
| Defaults to ``False``.
| :type autoencoder: bool
| :param sparse: Sparse data handling (more efficient for data with lots of 0 values).
| Defaults to ``False``.
| :type sparse: bool
| :param col_major: #DEPRECATED Use a column major weight matrix for input layer. Can speed up forward
| propagation, but might slow down backpropagation.
| Defaults to ``False``.
| :type col_major: bool
| :param average_activation: Average activation for sparse auto-encoder. #Experimental
| Defaults to ``0.0``.
| :type average_activation: float
| :param sparsity_beta: Sparsity regularization. #Experimental
| Defaults to ``0.0``.
| :type sparsity_beta: float
| :param max_categorical_features: Max. number of categorical features, enforced via hashing. #Experimental
| Defaults to ``2147483647``.
| :type max_categorical_features: int
| :param reproducible: Force reproducibility on small data (will be slow - only uses 1 thread).
| Defaults to ``False``.
| :type reproducible: bool
| :param export_weights_and_biases: Whether to export Neural Network weights and biases to H2O Frames.
| Defaults to ``False``.
| :type export_weights_and_biases: bool
| :param mini_batch_size: Mini-batch size (smaller leads to better fit, larger can speed up and generalize
| better).
| Defaults to ``1``.
| :type mini_batch_size: int
| :param categorical_encoding: Encoding scheme for categorical features
| Defaults to ``"auto"``.
| :type categorical_encoding: Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder",
| "sort_by_response", "enum_limited"]
| :param elastic_averaging: Elastic averaging between compute nodes can improve distributed model convergence.
| #Experimental
| Defaults to ``False``.
| :type elastic_averaging: bool
| :param elastic_averaging_moving_rate: Elastic averaging moving rate (only if elastic averaging is enabled).
| Defaults to ``0.9``.
| :type elastic_averaging_moving_rate: float
| :param elastic_averaging_regularization: Elastic averaging regularization strength (only if elastic averaging is
| enabled).
| Defaults to ``0.001``.
| :type elastic_averaging_regularization: float
| :param export_checkpoints_dir: Automatically export generated models to this directory.
| Defaults to ``None``.
| :type export_checkpoints_dir: str, optional
| :param auc_type: Set default multinomial AUC type.
| Defaults to ``"auto"``.
| :type auc_type: Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"]
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| activation
| Activation function.
|
| Type: ``Literal["tanh", "tanh_with_dropout", "rectifier", "rectifier_with_dropout", "maxout",
| "maxout_with_dropout"]``, defaults to ``"rectifier"``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> cars_dl = H2ODeepLearningEstimator(activation="tanh")
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| adaptive_rate
| Adaptive learning rate.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> cars_dl = H2ODeepLearningEstimator(adaptive_rate=True)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| auc_type
| Set default multinomial AUC type.
|
| Type: ``Literal["auto", "none", "macro_ovr", "weighted_ovr", "macro_ovo", "weighted_ovo"]``, defaults to
| ``"auto"``.
|
| autoencoder
| Auto-Encoder.
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> cars_dl = H2ODeepLearningEstimator(autoencoder=True)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| average_activation
| Average activation for sparse auto-encoder. #Experimental
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> cars_dl = H2ODeepLearningEstimator(average_activation=1.5,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| balance_classes
| Balance training data class counts via over/under-sampling (for imbalanced data).
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> cov_dl = H2ODeepLearningEstimator(balance_classes=True,
| ... seed=1234)
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.mse()
|
| categorical_encoding
| Encoding scheme for categorical features
|
| Type: ``Literal["auto", "enum", "one_hot_internal", "one_hot_explicit", "binary", "eigen", "label_encoder",
| "sort_by_response", "enum_limited"]``, defaults to ``"auto"``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> encoding = "one_hot_internal"
| >>> airlines_dl = H2ODeepLearningEstimator(categorical_encoding=encoding,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.mse()
|
| checkpoint
| Model checkpoint to resume training with.
|
| Type: ``Union[None, str, H2OEstimator]``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(activation="tanh",
| ... autoencoder=True,
| ... seed=1234,
| ... model_id="cars_dl")
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
| >>> cars_cont = H2ODeepLearningEstimator(checkpoint=cars_dl,
| ... seed=1234)
| >>> cars_cont.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_cont.mse()
|
| class_sampling_factors
| Desired over/under-sampling ratios per class (in lexicographic order). If not specified, sampling factors will
| be automatically computed to obtain class balance during training. Requires balance_classes.
|
| Type: ``List[float]``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> sample_factors = [1., 0.5, 1., 1., 1., 1., 1.]
| >>> cars_dl = H2ODeepLearningEstimator(balance_classes=True,
| ... class_sampling_factors=sample_factors,
| ... seed=1234)
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.mse()
|
| classification_stop
| Stopping criterion for classification error fraction on training data (-1 to disable).
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(classification_stop=1.5,
| ... seed=1234)
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.mse()
|
| col_major
| #DEPRECATED Use a column major weight matrix for input layer. Can speed up forward propagation, but might slow
| down backpropagation.
|
| Type: ``bool``, defaults to ``False``.
|
| diagnostics
| Enable diagnostics for hidden layers.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(diagnostics=True,
| ... seed=1234)
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.mse()
|
| distribution
| Distribution function
|
| Type: ``Literal["auto", "bernoulli", "multinomial", "gaussian", "poisson", "gamma", "tweedie", "laplace",
| "quantile", "huber"]``, defaults to ``"auto"``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(distribution="poisson",
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| elastic_averaging
| Elastic averaging between compute nodes can improve distributed model convergence. #Experimental
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(elastic_averaging=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| elastic_averaging_moving_rate
| Elastic averaging moving rate (only if elastic averaging is enabled).
|
| Type: ``float``, defaults to ``0.9``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(elastic_averaging_moving_rate=.8,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| elastic_averaging_regularization
| Elastic averaging regularization strength (only if elastic averaging is enabled).
|
| Type: ``float``, defaults to ``0.001``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(elastic_averaging_regularization=.008,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| epochs
| How many times the dataset should be iterated (streamed), can be fractional.
|
| Type: ``float``, defaults to ``10.0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(epochs=15,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| epsilon
| Adaptive learning rate smoothing factor (to avoid divisions by zero and allow progress).
|
| Type: ``float``, defaults to ``1e-08``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(epsilon=1e-6,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| export_checkpoints_dir
| Automatically export generated models to this directory.
|
| Type: ``str``.
|
| :examples:
|
| >>> import tempfile
| >>> from os import listdir
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> checkpoints_dir = tempfile.mkdtemp()
| >>> cars_dl = H2ODeepLearningEstimator(export_checkpoints_dir=checkpoints_dir,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> len(listdir(checkpoints_dir))
|
| export_weights_and_biases
| Whether to export Neural Network weights and biases to H2O Frames.
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(export_weights_and_biases=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| fast_mode
| Enable fast mode (minor approximation in back-propagation).
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(fast_mode=False,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| fold_assignment
| Cross-validation fold assignment scheme, if fold_column is not specified. The 'Stratified' option will stratify
| the folds based on the response variable, for classification problems.
|
| Type: ``Literal["auto", "random", "modulo", "stratified"]``, defaults to ``"auto"``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(fold_assignment="Random",
| ... nfolds=5,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| fold_column
| Column with cross-validation fold index assignment per observation.
|
| Type: ``str``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> fold_numbers = cars.kfold_column(n_folds=5, seed=1234)
| >>> fold_numbers.set_names(["fold_numbers"])
| >>> cars = cars.cbind(fold_numbers)
| >>> print(cars['fold_numbers'])
| >>> cars_dl = H2ODeepLearningEstimator(seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars,
| ... fold_column="fold_numbers")
| >>> cars_dl.mse()
|
| force_load_balance
| Force extra load balancing to increase training speed for small datasets (to keep all cores busy).
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(force_load_balance=False,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| hidden
| Hidden layer sizes (e.g. [100, 100]).
|
| Type: ``List[int]``, defaults to ``[200, 200]``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "cylinders"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(hidden=[100,100],
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.mse()
|
| hidden_dropout_ratios
| Hidden layer dropout ratios (can improve generalization), specify one value per hidden layer, defaults to 0.5.
|
| Type: ``List[float]``.
|
| :examples:
|
| >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
| >>> valid = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
| >>> features = list(range(0,784))
| >>> target = 784
| >>> train[target] = train[target].asfactor()
| >>> valid[target] = valid[target].asfactor()
| >>> model = H2ODeepLearningEstimator(epochs=20,
| ... hidden=[200,200],
| ... hidden_dropout_ratios=[0.5,0.5],
| ... seed=1234,
| ... activation='tanhwithdropout')
| >>> model.train(x=features,
| ... y=target,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> model.mse()
|
| huber_alpha
| Desired quantile for Huber/M-regression (threshold between quadratic and linear loss, must be between 0 and 1).
|
| Type: ``float``, defaults to ``0.9``.
|
| :examples:
|
| >>> insurance = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/glm_test/insurance.csv")
| >>> predictors = insurance.columns[0:4]
| >>> response = 'Claims'
| >>> insurance['Group'] = insurance['Group'].asfactor()
| >>> insurance['Age'] = insurance['Age'].asfactor()
| >>> train, valid = insurance.split_frame(ratios=[.8], seed=1234)
| >>> insurance_dl = H2ODeepLearningEstimator(distribution="huber",
| ... huber_alpha=0.9,
| ... seed=1234)
| >>> insurance_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> insurance_dl.mse()
|
| ignore_const_cols
| Ignore constant columns.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars["const_1"] = 6
| >>> cars["const_2"] = 7
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(seed=1234,
| ... ignore_const_cols=True)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| ignored_columns
| Names of columns to ignore for training.
|
| Type: ``List[str]``.
|
| initial_biases
| A list of H2OFrame ids to initialize the bias vectors of this model with.
|
| Type: ``List[Union[None, str, H2OFrame]]``.
|
| :examples:
|
| >>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
| >>> dl1 = H2ODeepLearningEstimator(hidden=[10,10],
| ... export_weights_and_biases=True)
| >>> dl1.train(x=list(range(4)), y=4, training_frame=iris)
| >>> p1 = dl1.model_performance(iris).logloss()
| >>> ll1 = dl1.predict(iris)
| >>> print(p1)
| >>> w1 = dl1.weights(0)
| >>> w2 = dl1.weights(1)
| >>> w3 = dl1.weights(2)
| >>> b1 = dl1.biases(0)
| >>> b2 = dl1.biases(1)
| >>> b3 = dl1.biases(2)
| >>> dl2 = H2ODeepLearningEstimator(hidden=[10,10],
| ... initial_weights=[w1, w2, w3],
| ... initial_biases=[b1, b2, b3],
| ... epochs=0)
| >>> dl2.train(x=list(range(4)), y=4, training_frame=iris)
| >>> dl2.initial_biases
|
| initial_weight_distribution
| Initial weight distribution.
|
| Type: ``Literal["uniform_adaptive", "uniform", "normal"]``, defaults to ``"uniform_adaptive"``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(initial_weight_distribution="Uniform",
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| initial_weight_scale
| Uniform: -value...value, Normal: stddev.
|
| Type: ``float``, defaults to ``1.0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(initial_weight_scale=1.5,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| initial_weights
| A list of H2OFrame ids to initialize the weight matrices of this model with.
|
| Type: ``List[Union[None, str, H2OFrame]]``.
|
| :examples:
|
| >>> iris = h2o.import_file("http://h2o-public-test-data.s3.amazonaws.com/smalldata/iris/iris.csv")
| >>> dl1 = H2ODeepLearningEstimator(hidden=[10,10],
| ... export_weights_and_biases=True)
| >>> dl1.train(x=list(range(4)), y=4, training_frame=iris)
| >>> p1 = dl1.model_performance(iris).logloss()
| >>> ll1 = dl1.predict(iris)
| >>> print(p1)
| >>> w1 = dl1.weights(0)
| >>> w2 = dl1.weights(1)
| >>> w3 = dl1.weights(2)
| >>> b1 = dl1.biases(0)
| >>> b2 = dl1.biases(1)
| >>> b3 = dl1.biases(2)
| >>> dl2 = H2ODeepLearningEstimator(hidden=[10,10],
| ... initial_weights=[w1, w2, w3],
| ... initial_biases=[b1, b2, b3],
| ... epochs=0)
| >>> dl2.train(x=list(range(4)), y=4, training_frame=iris)
| >>> dl2.initial_weights
|
| input_dropout_ratio
| Input layer dropout ratio (can improve generalization, try 0.1 or 0.2).
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(input_dropout_ratio=0.2,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| keep_cross_validation_fold_assignment
| Whether to keep the cross-validation fold assignment.
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_fold_assignment=True,
| ... nfolds=5,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> print(cars_dl.cross_validation_fold_assignment())
|
| keep_cross_validation_models
| Whether to keep the cross-validation models.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_models=True,
| ... nfolds=5,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> print(cars_dl.cross_validation_models())
|
| keep_cross_validation_predictions
| Whether to keep the predictions of the cross-validation models.
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(keep_cross_validation_predictions=True,
| ... nfolds=5,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> print(cars_dl.cross_validation_predictions())
|
| l1
| L1 regularization (can add stability and improve generalization, causes many weights to become 0).
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> hh_imbalanced = H2ODeepLearningEstimator(l1=1e-5,
| ... activation="Rectifier",
| ... loss="CrossEntropy",
| ... hidden=[200,200],
| ... epochs=1,
| ... balance_classes=False,
| ... reproducible=True,
| ... seed=1234)
| >>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype)
| >>> hh_imbalanced.mse()
|
| l2
| L2 regularization (can add stability and improve generalization, causes many weights to be small.
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> hh_imbalanced = H2ODeepLearningEstimator(l2=1e-5,
| ... activation="Rectifier",
| ... loss="CrossEntropy",
| ... hidden=[200,200],
| ... epochs=1,
| ... balance_classes=False,
| ... reproducible=True,
| ... seed=1234)
| >>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype)
| >>> hh_imbalanced.mse()
|
| loss
| Loss function.
|
| Type: ``Literal["automatic", "cross_entropy", "quadratic", "huber", "absolute", "quantile"]``, defaults to
| ``"automatic"``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> hh_imbalanced = H2ODeepLearningEstimator(l1=1e-5,
| ... activation="Rectifier",
| ... loss="CrossEntropy",
| ... hidden=[200,200],
| ... epochs=1,
| ... balance_classes=False,
| ... reproducible=True,
| ... seed=1234)
| >>> hh_imbalanced.train(x=list(range(54)),y=54, training_frame=covtype)
| >>> hh_imbalanced.mse()
|
| max_after_balance_size
| Maximum relative size of the training data after balancing class counts (can be less than 1.0). Requires
| balance_classes.
|
| Type: ``float``, defaults to ``5.0``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> max = .85
| >>> cov_dl = H2ODeepLearningEstimator(balance_classes=True,
| ... max_after_balance_size=max,
| ... seed=1234)
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.logloss()
|
| max_categorical_features
| Max. number of categorical features, enforced via hashing. #Experimental
|
| Type: ``int``, defaults to ``2147483647``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> cov_dl = H2ODeepLearningEstimator(balance_classes=True,
| ... max_categorical_features=2147483647,
| ... seed=1234)
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.logloss()
|
| max_confusion_matrix_size
| [Deprecated] Maximum size (# classes) for confusion matrices to be printed in the Logs.
|
| Type: ``int``, defaults to ``20``.
|
| max_runtime_secs
| Maximum allowed runtime in seconds for model training. Use 0 to disable.
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(max_runtime_secs=10,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| max_w2
| Constraint for squared sum of incoming weights per unit (e.g. for Rectifier).
|
| Type: ``float``, defaults to ``3.4028235e+38``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> cov_dl = H2ODeepLearningEstimator(activation="RectifierWithDropout",
| ... hidden=[10,10],
| ... epochs=10,
| ... input_dropout_ratio=0.2,
| ... l1=1e-5,
| ... max_w2=10.5,
| ... stopping_rounds=0)
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.mse()
|
| mini_batch_size
| Mini-batch size (smaller leads to better fit, larger can speed up and generalize better).
|
| Type: ``int``, defaults to ``1``.
|
| :examples:
|
| >>> covtype = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/covtype/covtype.20k.data")
| >>> covtype[54] = covtype[54].asfactor()
| >>> predictors = covtype.columns[0:54]
| >>> response = 'C55'
| >>> train, valid = covtype.split_frame(ratios=[.8], seed=1234)
| >>> cov_dl = H2ODeepLearningEstimator(activation="RectifierWithDropout",
| ... hidden=[10,10],
| ... epochs=10,
| ... input_dropout_ratio=0.2,
| ... l1=1e-5,
| ... max_w2=10.5,
| ... stopping_rounds=0)
| ... mini_batch_size=35
| >>> cov_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cov_dl.mse()
|
| missing_values_handling
| Handling of missing values. Either MeanImputation or Skip.
|
| Type: ``Literal["mean_imputation", "skip"]``, defaults to ``"mean_imputation"``.
|
| :examples:
|
| >>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
| >>> predictors = boston.columns[:-1]
| >>> response = "medv"
| >>> boston['chas'] = boston['chas'].asfactor()
| >>> boston.insert_missing_values()
| >>> train, valid = boston.split_frame(ratios=[.8])
| >>> boston_dl = H2ODeepLearningEstimator(missing_values_handling="skip")
| >>> boston_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> boston_dl.mse()
|
| momentum_ramp
| Number of training samples for which momentum increases.
|
| Type: ``float``, defaults to ``1000000.0``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime",
| ... "CRSArrTime","UniqueCarrier","FlightNum"]
| >>> response_col = "IsDepDelayed"
| >>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200],
| ... activation="Rectifier",
| ... input_dropout_ratio=0.0,
| ... momentum_start=0.9,
| ... momentum_stable=0.99,
| ... momentum_ramp=1e7,
| ... epochs=100,
| ... stopping_rounds=4,
| ... train_samples_per_iteration=30000,
| ... mini_batch_size=32,
| ... score_duty_cycle=0.25,
| ... score_interval=1)
| >>> airlines_dl.train(x=predictors,
| ... y=response_col,
| ... training_frame=airlines)
| >>> airlines_dl.mse()
|
| momentum_stable
| Final momentum after the ramp is over (try 0.99).
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime",
| ... "CRSArrTime","UniqueCarrier","FlightNum"]
| >>> response_col = "IsDepDelayed"
| >>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200],
| ... activation="Rectifier",
| ... input_dropout_ratio=0.0,
| ... momentum_start=0.9,
| ... momentum_stable=0.99,
| ... momentum_ramp=1e7,
| ... epochs=100,
| ... stopping_rounds=4,
| ... train_samples_per_iteration=30000,
| ... mini_batch_size=32,
| ... score_duty_cycle=0.25,
| ... score_interval=1)
| >>> airlines_dl.train(x=predictors,
| ... y=response_col,
| ... training_frame=airlines)
| >>> airlines_dl.mse()
|
| momentum_start
| Initial momentum at the beginning of training (try 0.5).
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> predictors = ["Year","Month","DayofMonth","DayOfWeek","CRSDepTime",
| ... "CRSArrTime","UniqueCarrier","FlightNum"]
| >>> response_col = "IsDepDelayed"
| >>> airlines_dl = H2ODeepLearningEstimator(hidden=[200,200],
| ... activation="Rectifier",
| ... input_dropout_ratio=0.0,
| ... momentum_start=0.9,
| ... momentum_stable=0.99,
| ... momentum_ramp=1e7,
| ... epochs=100,
| ... stopping_rounds=4,
| ... train_samples_per_iteration=30000,
| ... mini_batch_size=32,
| ... score_duty_cycle=0.25,
| ... score_interval=1)
| >>> airlines_dl.train(x=predictors,
| ... y=response_col,
| ... training_frame=airlines)
| >>> airlines_dl.mse()
|
| nesterov_accelerated_gradient
| Use Nesterov accelerated gradient (recommended).
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
| >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
| >>> predictors = list(range(0,784))
| >>> resp = 784
| >>> train[resp] = train[resp].asfactor()
| >>> test[resp] = test[resp].asfactor()
| >>> nclasses = train[resp].nlevels()[0]
| >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
| ... adaptive_rate=False,
| ... rate=0.01,
| ... rate_decay=0.9,
| ... rate_annealing=1e-6,
| ... momentum_start=0.95,
| ... momentum_ramp=1e5,
| ... momentum_stable=0.99,
| ... nesterov_accelerated_gradient=False,
| ... input_dropout_ratio=0.2,
| ... train_samples_per_iteration=20000,
| ... classification_stop=-1,
| ... l1=1e-5)
| >>> model.train (x=predictors,
| ... y=resp,
| ... training_frame=train,
| ... validation_frame=test)
| >>> model.model_performance()
|
| nfolds
| Number of folds for K-fold cross-validation (0 to disable or >= 2).
|
| Type: ``int``, defaults to ``0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(nfolds=5, seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| offset_column
| Offset column. This will be added to the combination of columns before applying the link function.
|
| Type: ``str``.
|
| :examples:
|
| >>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
| >>> predictors = boston.columns[:-1]
| >>> response = "medv"
| >>> boston['chas'] = boston['chas'].asfactor()
| >>> boston["offset"] = boston["medv"].log()
| >>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
| >>> boston_dl = H2ODeepLearningEstimator(offset_column="offset",
| ... seed=1234)
| >>> boston_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> boston_dl.mse()
|
| overwrite_with_best_model
| If enabled, override the final model with the best model found during training.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
| >>> predictors = boston.columns[:-1]
| >>> response = "medv"
| >>> boston['chas'] = boston['chas'].asfactor()
| >>> boston["offset"] = boston["medv"].log()
| >>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
| >>> boston_dl = H2ODeepLearningEstimator(overwrite_with_best_model=True,
| ... seed=1234)
| >>> boston_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> boston_dl.mse()
|
| pretrained_autoencoder
| Pretrained autoencoder model to initialize this model with.
|
| Type: ``Union[None, str, H2OEstimator]``.
|
| :examples:
|
| >>> from h2o.estimators.deeplearning import H2OAutoEncoderEstimator
| >>> resp = 784
| >>> nfeatures = 20
| >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
| >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
| >>> train[resp] = train[resp].asfactor()
| >>> test[resp] = test[resp].asfactor()
| >>> sid = train[0].runif(0)
| >>> train_unsupervised = train[sid>=0.5]
| >>> train_unsupervised.pop(resp)
| >>> train_supervised = train[sid<0.5]
| >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh",
| ... hidden=[nfeatures],
| ... model_id="ae_model",
| ... epochs=1,
| ... ignore_const_cols=False,
| ... reproducible=True,
| ... seed=1234)
| >>> ae_model.train(list(range(resp)), training_frame=train_unsupervised)
| >>> ae_model.mse()
| >>> pretrained_model = H2ODeepLearningEstimator(activation="Tanh",
| ... hidden=[nfeatures],
| ... epochs=1,
| ... reproducible = True,
| ... seed=1234,
| ... ignore_const_cols=False,
| ... pretrained_autoencoder="ae_model")
| >>> pretrained_model.train(list(range(resp)), resp,
| ... training_frame=train_supervised,
| ... validation_frame=test)
| >>> pretrained_model.mse()
|
| quantile_alpha
| Desired quantile for Quantile regression, must be between 0 and 1.
|
| Type: ``float``, defaults to ``0.5``.
|
| :examples:
|
| >>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
| >>> predictors = boston.columns[:-1]
| >>> response = "medv"
| >>> boston['chas'] = boston['chas'].asfactor()
| >>> train, valid = boston.split_frame(ratios=[.8], seed=1234)
| >>> boston_dl = H2ODeepLearningEstimator(distribution="quantile",
| ... quantile_alpha=.8,
| ... seed=1234)
| >>> boston_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> boston_dl.mse()
|
| quiet_mode
| Enable quiet mode for less output to standard output.
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
| >>> titanic['survived'] = titanic['survived'].asfactor()
| >>> predictors = titanic.columns
| >>> del predictors[1:3]
| >>> response = 'survived'
| >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
| >>> titanic_dl = H2ODeepLearningEstimator(quiet_mode=True,
| ... seed=1234)
| >>> titanic_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> titanic_dl.mse()
|
| rate
| Learning rate (higher => less stable, lower => slower convergence).
|
| Type: ``float``, defaults to ``0.005``.
|
| :examples:
|
| >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
| >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
| >>> predictors = list(range(0,784))
| >>> resp = 784
| >>> train[resp] = train[resp].asfactor()
| >>> test[resp] = test[resp].asfactor()
| >>> nclasses = train[resp].nlevels()[0]
| >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
| ... adaptive_rate=False,
| ... rate=0.01,
| ... rate_decay=0.9,
| ... rate_annealing=1e-6,
| ... momentum_start=0.95,
| ... momentum_ramp=1e5,
| ... momentum_stable=0.99,
| ... nesterov_accelerated_gradient=False,
| ... input_dropout_ratio=0.2,
| ... train_samples_per_iteration=20000,
| ... classification_stop=-1,
| ... l1=1e-5)
| >>> model.train (x=predictors,y=resp, training_frame=train, validation_frame=test)
| >>> model.model_performance(valid=True)
|
| rate_annealing
| Learning rate annealing: rate / (1 + rate_annealing * samples).
|
| Type: ``float``, defaults to ``1e-06``.
|
| :examples:
|
| >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
| >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
| >>> predictors = list(range(0,784))
| >>> resp = 784
| >>> train[resp] = train[resp].asfactor()
| >>> test[resp] = test[resp].asfactor()
| >>> nclasses = train[resp].nlevels()[0]
| >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
| ... adaptive_rate=False,
| ... rate=0.01,
| ... rate_decay=0.9,
| ... rate_annealing=1e-6,
| ... momentum_start=0.95,
| ... momentum_ramp=1e5,
| ... momentum_stable=0.99,
| ... nesterov_accelerated_gradient=False,
| ... input_dropout_ratio=0.2,
| ... train_samples_per_iteration=20000,
| ... classification_stop=-1,
| ... l1=1e-5)
| >>> model.train (x=predictors,
| ... y=resp,
| ... training_frame=train,
| ... validation_frame=test)
| >>> model.mse()
|
| rate_decay
| Learning rate decay factor between layers (N-th layer: rate * rate_decay ^ (n - 1).
|
| Type: ``float``, defaults to ``1.0``.
|
| :examples:
|
| >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
| >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
| >>> predictors = list(range(0,784))
| >>> resp = 784
| >>> train[resp] = train[resp].asfactor()
| >>> test[resp] = test[resp].asfactor()
| >>> nclasses = train[resp].nlevels()[0]
| >>> model = H2ODeepLearningEstimator(activation="RectifierWithDropout",
| ... adaptive_rate=False,
| ... rate=0.01,
| ... rate_decay=0.9,
| ... rate_annealing=1e-6,
| ... momentum_start=0.95,
| ... momentum_ramp=1e5,
| ... momentum_stable=0.99,
| ... nesterov_accelerated_gradient=False,
| ... input_dropout_ratio=0.2,
| ... train_samples_per_iteration=20000,
| ... classification_stop=-1,
| ... l1=1e-5)
| >>> model.train (x=predictors,
| ... y=resp,
| ... training_frame=train,
| ... validation_frame=test)
| >>> model.model_performance()
|
| regression_stop
| Stopping criterion for regression error (MSE) on training data (-1 to disable).
|
| Type: ``float``, defaults to ``1e-06``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(regression_stop=1e-6,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| replicate_training_data
| Replicate the entire training dataset onto every node for faster training on small datasets.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> airlines_dl = H2ODeepLearningEstimator(replicate_training_data=False)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=airlines)
| >>> airlines_dl.auc()
|
| reproducible
| Force reproducibility on small data (will be slow - only uses 1 thread).
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(reproducible=True)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| response_column
| Response variable column.
|
| Type: ``str``.
|
| rho
| Adaptive learning rate time decay factor (similarity to prior updates).
|
| Type: ``float``, defaults to ``0.99``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(rho=0.9,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| score_duty_cycle
| Maximum duty cycle fraction for scoring (lower: more training, higher: more scoring).
|
| Type: ``float``, defaults to ``0.1``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(score_duty_cycle=0.2,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| score_each_iteration
| Whether to score during each iteration of model training.
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(score_each_iteration=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| score_interval
| Shortest time interval (in seconds) between model scoring.
|
| Type: ``float``, defaults to ``5.0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(score_interval=3,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| score_training_samples
| Number of training set samples for scoring (0 for all).
|
| Type: ``int``, defaults to ``10000``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(score_training_samples=10000,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| score_validation_samples
| Number of validation set samples for scoring (0 for all).
|
| Type: ``int``, defaults to ``0``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(score_validation_samples=3,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| score_validation_sampling
| Method used to sample validation dataset for scoring.
|
| Type: ``Literal["uniform", "stratified"]``, defaults to ``"uniform"``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(score_validation_sampling="uniform",
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| seed
| Seed for random numbers (affects sampling) - Note: only reproducible when running single threaded.
|
| Type: ``int``, defaults to ``-1``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| shuffle_training_data
| Enable shuffling of training data (recommended if training data is replicated and train_samples_per_iteration is
| close to #nodes x #rows, of if using balance_classes).
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(shuffle_training_data=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| single_node_mode
| Run on a single node for fine-tuning of model parameters.
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(single_node_mode=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| sparse
| Sparse data handling (more efficient for data with lots of 0 values).
|
| Type: ``bool``, defaults to ``False``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(sparse=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| sparsity_beta
| Sparsity regularization. #Experimental
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> from h2o.estimators import H2OAutoEncoderEstimator
| >>> resp = 784
| >>> nfeatures = 20
| >>> train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/train.csv.gz")
| >>> test = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/bigdata/laptop/mnist/test.csv.gz")
| >>> train[resp] = train[resp].asfactor()
| >>> test[resp] = test[resp].asfactor()
| >>> sid = train[0].runif(0)
| >>> train_unsupervised = train[sid>=0.5]
| >>> train_unsupervised.pop(resp)
| >>> ae_model = H2OAutoEncoderEstimator(activation="Tanh",
| ... hidden=[nfeatures],
| ... epochs=1,
| ... ignore_const_cols=False,
| ... reproducible=True,
| ... sparsity_beta=0.5,
| ... seed=1234)
| >>> ae_model.train(list(range(resp)),
| ... training_frame=train_unsupervised)
| >>> ae_model.mse()
|
| standardize
| If enabled, automatically standardize the data. If disabled, the user must provide properly scaled input data.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> cars_dl = H2ODeepLearningEstimator(standardize=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=cars)
| >>> cars_dl.auc()
|
| stopping_metric
| Metric to use for early stopping (AUTO: logloss for classification, deviance for regression and anonomaly_score
| for Isolation Forest). Note that custom and custom_increasing can only be used in GBM and DRF with the Python
| client.
|
| Type: ``Literal["auto", "deviance", "logloss", "mse", "rmse", "mae", "rmsle", "auc", "aucpr", "lift_top_group",
| "misclassification", "mean_per_class_error", "custom", "custom_increasing"]``, defaults to ``"auto"``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc",
| ... stopping_rounds=3,
| ... stopping_tolerance=1e-2,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| stopping_rounds
| Early stopping based on convergence of stopping_metric. Stop if simple moving average of length k of the
| stopping_metric does not improve for k:=stopping_rounds scoring events (0 to disable)
|
| Type: ``int``, defaults to ``5``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc",
| ... stopping_rounds=3,
| ... stopping_tolerance=1e-2,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| stopping_tolerance
| Relative tolerance for metric-based stopping criterion (stop if relative improvement is not at least this much)
|
| Type: ``float``, defaults to ``0.0``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(stopping_metric="auc",
| ... stopping_rounds=3,
| ... stopping_tolerance=1e-2,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| target_ratio_comm_to_comp
| Target ratio of communication overhead to computation. Only for multi-node operation and
| train_samples_per_iteration = -2 (auto-tuning).
|
| Type: ``float``, defaults to ``0.05``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(target_ratio_comm_to_comp=0.05,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| train_samples_per_iteration
| Number of training samples (globally) per MapReduce iteration. Special values are 0: one epoch, -1: all
| available data (e.g., replicated training data), -2: automatic.
|
| Type: ``int``, defaults to ``-2``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(train_samples_per_iteration=-1,
| ... epochs=1,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| training_frame
| Id of the training data frame.
|
| Type: ``Union[None, str, H2OFrame]``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator()
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| tweedie_power
| Tweedie power for Tweedie regression, must be between 1 and 2.
|
| Type: ``float``, defaults to ``1.5``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(tweedie_power=1.5,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.auc()
|
| use_all_factor_levels
| Use all factor levels of categorical variables. Otherwise, the first factor level is omitted (without loss of
| accuracy). Useful for variable importances and auto-enabled for autoencoder.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(use_all_factor_levels=True,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.mse()
|
| validation_frame
| Id of the validation data frame.
|
| Type: ``Union[None, str, H2OFrame]``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","weight","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(standardize=True,
| ... seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| variable_importances
| Compute variable importances for input features (Gedeon method) - can be slow for large networks.
|
| Type: ``bool``, defaults to ``True``.
|
| :examples:
|
| >>> airlines= h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/airlines/allyears2k_headers.zip")
| >>> airlines["Year"]= airlines["Year"].asfactor()
| >>> airlines["Month"]= airlines["Month"].asfactor()
| >>> airlines["DayOfWeek"] = airlines["DayOfWeek"].asfactor()
| >>> airlines["Cancelled"] = airlines["Cancelled"].asfactor()
| >>> airlines['FlightNum'] = airlines['FlightNum'].asfactor()
| >>> predictors = ["Origin", "Dest", "Year", "UniqueCarrier",
| ... "DayOfWeek", "Month", "Distance", "FlightNum"]
| >>> response = "IsDepDelayed"
| >>> train, valid= airlines.split_frame(ratios=[.8], seed=1234)
| >>> airlines_dl = H2ODeepLearningEstimator(variable_importances=True,
| ... seed=1234)
| >>> airlines_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> airlines_dl.mse()
|
| weights_column
| Column with observation weights. Giving some observation a weight of zero is equivalent to excluding it from the
| dataset; giving an observation a relative weight of 2 is equivalent to repeating that row twice. Negative
| weights are not allowed. Note: Weights are per-row observation weights and do not increase the size of the data
| frame. This is typically the number of times a row is repeated, but non-integer values are supported as well.
| During training, rows with higher weights matter more, due to the larger loss function pre-factor. If you set
| weight = 0 for a row, the returned prediction frame at that row is zero and this is incorrect. To get an
| accurate prediction, remove all rows with weight == 0.
|
| Type: ``str``.
|
| :examples:
|
| >>> cars = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
| >>> cars["economy_20mpg"] = cars["economy_20mpg"].asfactor()
| >>> predictors = ["displacement","power","acceleration","year"]
| >>> response = "economy_20mpg"
| >>> train, valid = cars.split_frame(ratios=[.8], seed=1234)
| >>> cars_dl = H2ODeepLearningEstimator(seed=1234)
| >>> cars_dl.train(x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> cars_dl.auc()
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| algo = 'deeplearning'
|
| supervised_learning = True
|
| ----------------------------------------------------------------------
| Methods inherited from h2o.estimators.estimator_base.H2OEstimator:
|
| fit(self, X, y=None, **params)
| Fit an H2O model as part of a scikit-learn pipeline or grid search.
|
| A warning will be issued if a caller other than sklearn attempts to use this method.
|
| :param H2OFrame X: An H2OFrame consisting of the predictor variables.
| :param H2OFrame y: An H2OFrame consisting of the response variable.
| :param params: Extra arguments.
| :returns: The current instance of H2OEstimator for method chaining.
|
| get_params(self, deep=True)
| Obtain parameters for this estimator.
|
| Used primarily for sklearn Pipelines and sklearn grid search.
|
| :param deep: If True, return parameters of all sub-objects that are estimators.
|
| :returns: A dict of parameters
|
| join(self)
| Wait until job's completion.
|
| set_params(self, **parms)
| Used by sklearn for updating parameters during grid search.
|
| :param parms: A dictionary of parameters that will be set on this model.
| :returns: self, the current estimator object with the parameters all set as desired.
|
| start(self, x, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, **params)
| Train the model asynchronously (to block for results call :meth:`join`).
|
| :param x: A list of column names or indices indicating the predictor columns.
| :param y: An index or a column name indicating the response column.
| :param H2OFrame training_frame: The H2OFrame having the columns indicated by x and y (as well as any
| additional columns specified by fold, offset, and weights).
| :param offset_column: The name or index of the column in training_frame that holds the offsets.
| :param fold_column: The name or index of the column in training_frame that holds the per-row fold
| assignments.
| :param weights_column: The name or index of the column in training_frame that holds the per-row weights.
| :param validation_frame: H2OFrame with validation data to be scored on while training.
|
| train(self, x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, ignored_columns=None, model_id=None, verbose=False)
| Train the H2O model.
|
| :param x: A list of column names or indices indicating the predictor columns.
| :param y: An index or a column name indicating the response column.
| :param H2OFrame training_frame: The H2OFrame having the columns indicated by x and y (as well as any
| additional columns specified by fold, offset, and weights).
| :param offset_column: The name or index of the column in training_frame that holds the offsets.
| :param fold_column: The name or index of the column in training_frame that holds the per-row fold
| assignments.
| :param weights_column: The name or index of the column in training_frame that holds the per-row weights.
| :param validation_frame: H2OFrame with validation data to be scored on while training.
| :param float max_runtime_secs: Maximum allowed runtime in seconds for model training. Use 0 to disable.
| :param bool verbose: Print scoring history to stdout. Defaults to False.
|
| train_segments(self, x=None, y=None, training_frame=None, offset_column=None, fold_column=None, weights_column=None, validation_frame=None, max_runtime_secs=None, ignored_columns=None, segments=None, segment_models_id=None, parallelism=1, verbose=False)
| Trains H2O model for each segment (subpopulation) of the training dataset.
|
| :param x: A list of column names or indices indicating the predictor columns.
| :param y: An index or a column name indicating the response column.
| :param H2OFrame training_frame: The H2OFrame having the columns indicated by x and y (as well as any
| additional columns specified by fold, offset, and weights).
| :param offset_column: The name or index of the column in training_frame that holds the offsets.
| :param fold_column: The name or index of the column in training_frame that holds the per-row fold
| assignments.
| :param weights_column: The name or index of the column in training_frame that holds the per-row weights.
| :param validation_frame: H2OFrame with validation data to be scored on while training.
| :param float max_runtime_secs: Maximum allowed runtime in seconds for each model training. Use 0 to disable.
| Please note that regardless of how this parameter is set, a model will be built for each input segment.
| This parameter only affects individual model training.
| :param segments: A list of columns to segment-by. H2O will group the training (and validation) dataset
| by the segment-by columns and train a separate model for each segment (group of rows).
| As an alternative to providing a list of columns, users can also supply an explicit enumeration of
| segments to build the models for. This enumeration needs to be represented as H2OFrame.
| :param segment_models_id: Identifier for the returned collection of Segment Models. If not specified
| it will be automatically generated.
| :param parallelism: Level of parallelism of the bulk segment models building, it is the maximum number
| of models each H2O node will be building in parallel.
| :param bool verbose: Enable to print additional information during model building. Defaults to False.
|
| :examples:
|
| >>> response = "survived"
| >>> titanic = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/titanic.csv")
| >>> titanic[response] = titanic[response].asfactor()
| >>> predictors = ["survived","name","sex","age","sibsp","parch","ticket","fare","cabin"]
| >>> train, valid = titanic.split_frame(ratios=[.8], seed=1234)
| >>> from h2o.estimators.gbm import H2OGradientBoostingEstimator
| >>> titanic_gbm = H2OGradientBoostingEstimator(seed=1234)
| >>> titanic_models = titanic_gbm.train_segments(segments=["pclass"],
| ... x=predictors,
| ... y=response,
| ... training_frame=train,
| ... validation_frame=valid)
| >>> titanic_models.as_frame()
|
| ----------------------------------------------------------------------
| Methods inherited from h2o.model.model_base.ModelBase:
|
| __getattr__(self, name)
|
| __repr__(self)
| Return repr(self).
|
| aic(self, train=False, valid=False, xval=False)
| Get the AIC (Akaike Information Criterium).
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the AIC value for the training data.
| :param bool valid: If valid is True, then return the AIC value for the validation data.
| :param bool xval: If xval is True, then return the AIC value for the validation data.
|
| :returns: The AIC.
|
| auc(self, train=False, valid=False, xval=False)
| Get the AUC (Area Under Curve).
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the AUC value for the training data.
| :param bool valid: If valid is True, then return the AUC value for the validation data.
| :param bool xval: If xval is True, then return the AUC value for the validation data.
|
| :returns: The AUC.
|
| aucpr(self, train=False, valid=False, xval=False)
| Get the aucPR (Area Under PRECISION RECALL Curve).
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the aucpr value for the training data.
| :param bool valid: If valid is True, then return the aucpr value for the validation data.
| :param bool xval: If xval is True, then return the aucpr value for the validation data.
|
| :returns: The aucpr.
|
| biases(self, vector_id=0)
| Return the frame for the respective bias vector.
|
| :param: vector_id: an integer, ranging from 0 to number of layers, that specifies the bias vector to return.
|
| :returns: an H2OFrame which represents the bias vector identified by vector_id
|
| catoffsets(self)
| Categorical offsets for one-hot encoding.
|
| coef(self)
| Return the coefficients which can be applied to the non-standardized data.
|
| Note: standardize = True by default, if set to False then coef() return the coefficients which are fit directly.
|
| coef_norm(self)
| Return coefficients fitted on the standardized data (requires standardize = True, which is on by default).
|
| These coefficients can be used to evaluate variable importance.
|
| cross_validation_fold_assignment(self)
| Obtain the cross-validation fold assignment for all rows in the training data.
|
| :returns: H2OFrame
|
| cross_validation_holdout_predictions(self)
| Obtain the (out-of-sample) holdout predictions of all cross-validation models on the training data.
|
| This is equivalent to summing up all H2OFrames returned by cross_validation_predictions.
|
| :returns: H2OFrame
|
| cross_validation_metrics_summary(self)
| Retrieve Cross-Validation Metrics Summary.
|
| :returns: The cross-validation metrics summary as an H2OTwoDimTable
|
| cross_validation_models(self)
| Obtain a list of cross-validation models.
|
| :returns: list of H2OModel objects.
|
| cross_validation_predictions(self)
| Obtain the (out-of-sample) holdout predictions of all cross-validation models on their holdout data.
|
| Note that the predictions are expanded to the full number of rows of the training data, with 0 fill-in.
|
| :returns: list of H2OFrame objects.
|
| deepfeatures(self, test_data, layer)
| Return hidden layer details.
|
| :param test_data: Data to create a feature space on
| :param layer: 0 index hidden layer
|
| detach(self)
| Detach the Python object from the backend, usually by clearing its key
|
| download_model(self, path='', filename=None)
| Download an H2O Model object to disk.
|
| :param path: a path to the directory where the model should be saved.
| :param filename: a filename for the saved model
|
| :returns: the path of the downloaded model
|
| download_mojo(self, path='.', get_genmodel_jar=False, genmodel_name='')
| Download the model in MOJO format.
|
| :param path: the path where MOJO file should be saved.
| :param get_genmodel_jar: if True, then also download h2o-genmodel.jar and store it in folder ``path``.
| :param genmodel_name: Custom name of genmodel jar
| :returns: name of the MOJO file written.
|
| download_pojo(self, path='', get_genmodel_jar=False, genmodel_name='')
| Download the POJO for this model to the directory specified by path.
|
| If path is an empty string, then dump the output to screen.
|
| :param path: An absolute path to the directory where POJO should be saved.
| :param get_genmodel_jar: if True, then also download h2o-genmodel.jar and store it in folder ``path``.
| :param genmodel_name: Custom name of genmodel jar
| :returns: name of the POJO file written.
|
| explain(models, frame, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, figsize=(16, 9), render=True, qualitative_colormap='Dark2', sequential_colormap='RdYlBu_r')
| Generate model explanations on frame data set.
|
| The H2O Explainability Interface is a convenient wrapper to a number of explainabilty
| methods and visualizations in H2O. The function can be applied to a single model or group
| of models and returns an object containing explanations, such as a partial dependence plot
| or a variable importance plot. Most of the explanations are visual (plots).
| These plots can also be created by individual utility functions/methods as well.
|
| :param models: a list of H2O models, an H2O AutoML instance, or an H2OFrame with a 'model_id' column (e.g. H2OAutoML leaderboard)
| :param frame: H2OFrame
| :param columns: either a list of columns or column indices to show. If specified
| parameter top_n_features will be ignored.
| :param top_n_features: a number of columns to pick using variable importance (where applicable).
| :param include_explanations: if specified, return only the specified model explanations
| (Mutually exclusive with exclude_explanations)
| :param exclude_explanations: exclude specified model explanations
| :param plot_overrides: overrides for individual model explanations
| :param figsize: figure size; passed directly to matplotlib
| :param render: if True, render the model explanations; otherwise model explanations are just returned
| :returns: H2OExplanation containing the model explanations including headers and descriptions
|
| :examples:
| >>> import h2o
| >>> from h2o.automl import H2OAutoML
| >>>
| >>> h2o.init()
| >>>
| >>> # Import the wine dataset into H2O:
| >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
| >>> df = h2o.import_file(f)
| >>>
| >>> # Set the response
| >>> response = "quality"
| >>>
| >>> # Split the dataset into a train and test set:
| >>> train, test = df.split_frame([0.8])
| >>>
| >>> # Train an H2OAutoML
| >>> aml = H2OAutoML(max_models=10)
| >>> aml.train(y=response, training_frame=train)
| >>>
| >>> # Create the H2OAutoML explanation
| >>> aml.explain(test)
| >>>
| >>> # Create the leader model explanation
| >>> aml.leader.explain(test)
|
| explain_row(models, frame, row_index, columns=None, top_n_features=5, include_explanations='ALL', exclude_explanations=[], plot_overrides={}, qualitative_colormap='Dark2', figsize=(16, 9), render=True)
| Generate model explanations on frame data set for a given instance.
|
| Explain the behavior of a model or group of models with respect to a single row of data.
| The function returns an object containing explanations, such as a partial dependence plot
| or a variable importance plot. Most of the explanations are visual (plots).
| These plots can also be created by individual utility functions/methods as well.
|
| :param models: H2OAutoML object, supervised H2O model, or list of supervised H2O models
| :param frame: H2OFrame
| :param row_index: row index of the instance to inspect
| :param columns: either a list of columns or column indices to show. If specified
| parameter top_n_features will be ignored.
| :param top_n_features: a number of columns to pick using variable importance (where applicable).
| :param include_explanations: if specified, return only the specified model explanations
| (Mutually exclusive with exclude_explanations)
| :param exclude_explanations: exclude specified model explanations
| :param plot_overrides: overrides for individual model explanations
| :param qualitative_colormap: a colormap name
| :param figsize: figure size; passed directly to matplotlib
| :param render: if True, render the model explanations; otherwise model explanations are just returned
|
| :returns: H2OExplanation containing the model explanations including headers and descriptions
|
| :examples:
| >>> import h2o
| >>> from h2o.automl import H2OAutoML
| >>>
| >>> h2o.init()
| >>>
| >>> # Import the wine dataset into H2O:
| >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
| >>> df = h2o.import_file(f)
| >>>
| >>> # Set the response
| >>> response = "quality"
| >>>
| >>> # Split the dataset into a train and test set:
| >>> train, test = df.split_frame([0.8])
| >>>
| >>> # Train an H2OAutoML
| >>> aml = H2OAutoML(max_models=10)
| >>> aml.train(y=response, training_frame=train)
| >>>
| >>> # Create the H2OAutoML explanation
| >>> aml.explain_row(test, row_index=0)
| >>>
| >>> # Create the leader model explanation
| >>> aml.leader.explain_row(test, row_index=0)
|
| feature_frequencies(self, test_data)
| Retrieve the number of occurrences of each feature for given observations
| on their respective paths in a tree ensemble model.
| Available for GBM, Random Forest and Isolation Forest models.
|
| :param H2OFrame test_data: Data on which to calculate feature frequencies.
|
| :returns: A new H2OFrame made of feature contributions.
|
| feature_interaction(self, max_interaction_depth=100, max_tree_depth=100, max_deepening=-1, path=None)
| Feature interactions and importance, leaf statistics and split value histograms in a tabular form.
| Available for XGBoost and GBM.
|
| Metrics:
| Gain - Total gain of each feature or feature interaction.
| FScore - Amount of possible splits taken on a feature or feature interaction.
| wFScore - Amount of possible splits taken on a feature or feature interaction weighed by
| the probability of the splits to take place.
| Average wFScore - wFScore divided by FScore.
| Average Gain - Gain divided by FScore.
| Expected Gain - Total gain of each feature or feature interaction weighed by the probability to gather the gain.
| Average Tree Index
| Average Tree Depth
|
| :param max_interaction_depth: Upper bound for extracted feature interactions depth. Defaults to 100.
| :param max_tree_depth: Upper bound for tree depth. Defaults to 100.
| :param max_deepening: Upper bound for interaction start deepening (zero deepening => interactions
| starting at root only). Defaults to -1.
| :param path: (Optional) Path where to save the output in .xlsx format (e.g. ``/mypath/file.xlsx``).
| Please note that Pandas and XlsxWriter need to be installed for using this option. Defaults to None.
|
|
| :examples:
| >>> boston = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/gbm_test/BostonHousing.csv")
| >>> predictors = boston.columns[:-1]
| >>> response = "medv"
| >>> boston['chas'] = boston['chas'].asfactor()
| >>> train, valid = boston.split_frame(ratios=[.8])
| >>> boston_xgb = H2OXGBoostEstimator(seed=1234)
| >>> boston_xgb.train(y=response, x=predictors, training_frame=train)
| >>> feature_interactions = boston_xgb.feature_interaction()
|
| get_xval_models(self, key=None)
| Return a Model object.
|
| :param key: If None, return all cross-validated models; otherwise return the model that key points to.
|
| :returns: A model or list of models.
|
| gini(self, train=False, valid=False, xval=False)
| Get the Gini coefficient.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval"
|
| :param bool train: If train is True, then return the Gini Coefficient value for the training data.
| :param bool valid: If valid is True, then return the Gini Coefficient value for the validation data.
| :param bool xval: If xval is True, then return the Gini Coefficient value for the cross validation data.
|
| :returns: The Gini Coefficient for this binomial model.
|
| h(self, frame, variables)
| Calculates Friedman and Popescu's H statistics, in order to test for the presence of an interaction between specified variables in h2o gbm and xgb models.
| H varies from 0 to 1. It will have a value of 0 if the model exhibits no interaction between specified variables and a correspondingly larger value for a
| stronger interaction effect between them. NaN is returned if a computation is spoiled by weak main effects and rounding errors.
|
| See Jerome H. Friedman and Bogdan E. Popescu, 2008, "Predictive learning via rule ensembles", *Ann. Appl. Stat.*
| **2**:916-954, http://projecteuclid.org/download/pdfview_1/euclid.aoas/1223908046, s. 8.1.
|
|
| :param frame: the frame that current model has been fitted to
| :param variables: variables of the interest
| :return: H statistic of the variables
|
| :examples:
| >>> prostate_train = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/logreg/prostate_train.csv")
| >>> prostate_train["CAPSULE"] = prostate_train["CAPSULE"].asfactor()
| >>> gbm_h2o = H2OGradientBoostingEstimator(ntrees=100, learn_rate=0.1,
| >>> max_depth=5,
| >>> min_rows=10,
| >>> distribution="bernoulli")
| >>> gbm_h2o.train(x=list(range(1,prostate_train.ncol)),y="CAPSULE", training_frame=prostate_train)
| >>> h = gbm_h2o.h(prostate_train, ['DPROS','DCAPS'])
|
| ice_plot(model, frame, column, target=None, max_levels=30, figsize=(16, 9), colormap='plasma')
| Plot Individual Conditional Expectations (ICE) for each decile
|
| Individual conditional expectations (ICE) plot gives a graphical depiction of the marginal
| effect of a variable on the response. ICE plot is similar to partial dependence plot (PDP),
| PDP shows the average effect of a feature while ICE plot shows the effect for a single
| instance. The following plot shows the effect for each decile. In contrast to partial
| dependence plot, ICE plot can provide more insight especially when there is stronger feature interaction.
|
| :param model: H2OModel
| :param frame: H2OFrame
| :param column: string containing column name
| :param target: (only for multinomial classification) for what target should the plot be done
| :param max_levels: maximum number of factor levels to show
| :param figsize: figure size; passed directly to matplotlib
| :param colormap: colormap name
| :returns: a matplotlib figure object
|
| :examples:
| >>> import h2o
| >>> from h2o.estimators import H2OGradientBoostingEstimator
| >>>
| >>> h2o.init()
| >>>
| >>> # Import the wine dataset into H2O:
| >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
| >>> df = h2o.import_file(f)
| >>>
| >>> # Set the response
| >>> response = "quality"
| >>>
| >>> # Split the dataset into a train and test set:
| >>> train, test = df.split_frame([0.8])
| >>>
| >>> # Train a GBM
| >>> gbm = H2OGradientBoostingEstimator()
| >>> gbm.train(y=response, training_frame=train)
| >>>
| >>> # Create the individual conditional expectations plot
| >>> gbm.ice_plot(test, column="alcohol")
|
| is_cross_validated(self)
| Return True if the model was cross-validated.
|
| learning_curve_plot(model, metric='AUTO', cv_ribbon=None, cv_lines=None, figsize=(16, 9), colormap=None)
| Learning curve
|
| Create learning curve plot for an H2O Model. Learning curves show error metric dependence on
| learning progress, e.g., RMSE vs number of trees trained so far in GBM. There can be up to 4 curves
| showing Training, Validation, Training on CV Models, and Cross-validation error.
|
| :param model: an H2O model
| :param metric: a stopping metric
| :param cv_ribbon: if True, plot the CV mean as a and CV standard deviation as a ribbon around the mean,
| if None, it will attempt to automatically determine if this is suitable visualisation
| :param cv_lines: if True, plot scoring history for individual CV models, if None, it will attempt to
| automatically determine if this is suitable visualisation
| :param figsize: figure size; passed directly to matplotlib
| :param colormap: colormap to use
| :return: a matplotlib figure
|
| :examples:
| >>> import h2o
| >>> from h2o.estimators import H2OGradientBoostingEstimator
| >>>
| >>> h2o.init()
| >>>
| >>> # Import the wine dataset into H2O:
| >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
| >>> df = h2o.import_file(f)
| >>>
| >>> # Set the response
| >>> response = "quality"
| >>>
| >>> # Split the dataset into a train and test set:
| >>> train, test = df.split_frame([0.8])
| >>>
| >>> # Train a GBM
| >>> gbm = H2OGradientBoostingEstimator()
| >>> gbm.train(y=response, training_frame=train)
| >>>
| >>> # Create the learning curve plot
| >>> gbm.learning_curve_plot()
|
| logloss(self, train=False, valid=False, xval=False)
| Get the Log Loss.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the log loss value for the training data.
| :param bool valid: If valid is True, then return the log loss value for the validation data.
| :param bool xval: If xval is True, then return the log loss value for the cross validation data.
|
| :returns: The log loss for this regression model.
|
| mae(self, train=False, valid=False, xval=False)
| Get the Mean Absolute Error.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the MAE value for the training data.
| :param bool valid: If valid is True, then return the MAE value for the validation data.
| :param bool xval: If xval is True, then return the MAE value for the cross validation data.
|
| :returns: The MAE for this regression model.
|
| mean_residual_deviance(self, train=False, valid=False, xval=False)
| Get the Mean Residual Deviances.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the Mean Residual Deviance value for the training data.
| :param bool valid: If valid is True, then return the Mean Residual Deviance value for the validation data.
| :param bool xval: If xval is True, then return the Mean Residual Deviance value for the cross validation data.
|
| :returns: The Mean Residual Deviance for this regression model.
|
| model_performance(self, test_data=None, train=False, valid=False, xval=False, auc_type='none')
| Generate model metrics for this model on test_data.
|
| :param H2OFrame test_data: Data set for which model metrics shall be computed against. All three of train,
| valid and xval arguments are ignored if test_data is not None.
| :param bool train: Report the training metrics for the model.
| :param bool valid: Report the validation metrics for the model.
| :param bool xval: Report the cross-validation metrics for the model. If train and valid are True, then it
| defaults to True.
| :param String auc_type: Change default AUC type for multinomial classification AUC/AUCPR calculation when test_data is not None. One of: ``"auto"``, ``"none"``, ``"macro_ovr"``, ``"weighted_ovr"``, ``"macro_ovo"``, ``"weighted_ovo"`` (default: ``"none"``). If type is "auto" or "none" AUC and AUCPR is not calculated.
|
| :returns: An object of class H2OModelMetrics.
|
| mse(self, train=False, valid=False, xval=False)
| Get the Mean Square Error.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the MSE value for the training data.
| :param bool valid: If valid is True, then return the MSE value for the validation data.
| :param bool xval: If xval is True, then return the MSE value for the cross validation data.
|
| :returns: The MSE for this regression model.
|
| normmul(self)
| Normalization/Standardization multipliers for numeric predictors.
|
| normsub(self)
| Normalization/Standardization offsets for numeric predictors.
|
| ntrees_actual(self)
| Returns actual number of trees in a tree model. If early stopping enabled, GBM can reset the ntrees value.
| In this case, the actual ntrees value is less than the original ntrees value a user set before
| building the model.
|
| Type: ``float``
|
| null_degrees_of_freedom(self, train=False, valid=False, xval=False)
| Retreive the null degress of freedom if this model has the attribute, or None otherwise.
|
| :param bool train: Get the null dof for the training set. If both train and valid are False, then train is
| selected by default.
| :param bool valid: Get the null dof for the validation set. If both train and valid are True, then train is
| selected by default.
|
| :returns: Return the null dof, or None if it is not present.
|
| null_deviance(self, train=False, valid=False, xval=False)
| Retreive the null deviance if this model has the attribute, or None otherwise.
|
| :param bool train: Get the null deviance for the training set. If both train and valid are False, then train
| is selected by default.
| :param bool valid: Get the null deviance for the validation set. If both train and valid are True, then train
| is selected by default.
|
| :returns: Return the null deviance, or None if it is not present.
|
| partial_plot(self, data, cols=None, destination_key=None, nbins=20, weight_column=None, plot=True, plot_stddev=True, figsize=(7, 10), server=False, include_na=False, user_splits=None, col_pairs_2dpdp=None, save_to_file=None, row_index=None, targets=None)
| Create partial dependence plot which gives a graphical depiction of the marginal effect of a variable on the
| response. The effect of a variable is measured in change in the mean response.
|
| :param H2OFrame data: An H2OFrame object used for scoring and constructing the plot.
| :param cols: Feature(s) for which partial dependence will be calculated.
| :param destination_key: An key reference to the created partial dependence tables in H2O.
| :param nbins: Number of bins used. For categorical columns make sure the number of bins exceed the level count. If you enable add_missing_NA, the returned length will be nbin+1.
| :param weight_column: A string denoting which column of data should be used as the weight column.
| :param plot: A boolean specifying whether to plot partial dependence table.
| :param plot_stddev: A boolean specifying whether to add std err to partial dependence plot.
| :param figsize: Dimension/size of the returning plots, adjust to fit your output cells.
| :param server: Specify whether to activate matplotlib "server" mode. In this case, the plots are saved to a file instead of being rendered.
| :param include_na: A boolean specifying whether missing value should be included in the Feature values.
| :param user_splits: a dictionary containing column names as key and user defined split values as value in a list.
| :param col_pairs_2dpdp: list containing pairs of column names for 2D pdp
| :param save_to_file: Fully qualified name to an image file the resulting plot should be saved to, e.g. '/home/user/pdpplot.png'. The 'png' postfix might be omitted. If the file already exists, it will be overridden. Plot is only saved if plot = True.
| :param row_index: Row for which partial dependence will be calculated instead of the whole input frame.
| :param targets: Target classes for multiclass model.
| :returns: Plot and list of calculated mean response tables for each feature requested.
|
| pd_plot(model, frame, column, row_index=None, target=None, max_levels=30, figsize=(16, 9), colormap='Dark2')
| Plot partial dependence plot.
|
| Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable
| on the response. The effect of a variable is measured in change in the mean response.
| PDP assumes independence between the feature for which is the PDP computed and the rest.
|
| :param model: H2O Model object
| :param frame: H2OFrame
| :param column: string containing column name
| :param row_index: if None, do partial dependence, if integer, do individual
| conditional expectation for the row specified by this integer
| :param target: (only for multinomial classification) for what target should the plot be done
| :param max_levels: maximum number of factor levels to show
| :param figsize: figure size; passed directly to matplotlib
| :param colormap: colormap name; used to get just the first color to keep the api and color scheme similar with
| pd_multi_plot
| :returns: a matplotlib figure object
|
| :examples:
| >>> import h2o
| >>> from h2o.estimators import H2OGradientBoostingEstimator
| >>>
| >>> h2o.init()
| >>>
| >>> # Import the wine dataset into H2O:
| >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
| >>> df = h2o.import_file(f)
| >>>
| >>> # Set the response
| >>> response = "quality"
| >>>
| >>> # Split the dataset into a train and test set:
| >>> train, test = df.split_frame([0.8])
| >>>
| >>> # Train a GBM
| >>> gbm = H2OGradientBoostingEstimator()
| >>> gbm.train(y=response, training_frame=train)
| >>>
| >>> # Create partial dependence plot
| >>> gbm.pd_plot(test, column="alcohol")
|
| permutation_importance(self, frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, use_pandas=False)
| Get Permutation Variable Importance.
|
| When n_repeats == 1, the result is similar to the one from varimp() method, i.e., it contains
| the following columns "Relative Importance", "Scaled Importance", and "Percentage".
|
| When n_repeats > 1, the individual columns correspond to the permutation variable
| importance values from individual runs which corresponds to the "Relative Importance" and also
| to the distance between the original prediction error and prediction error using a frame with
| a given feature permuted.
|
| :param frame: training frame
| :param metric: metric to be used. One of "AUTO", "AUC", "MAE", "MSE", "RMSE", "logloss", "mean_per_class_error",
| "PR_AUC". Defaults to "AUTO".
| :param n_samples: number of samples to be evaluated. Use -1 to use the whole dataset. Defaults to 10 000.
| :param n_repeats: number of repeated evaluations. Defaults to 1.
| :param features: features to include in the permutation importance. Use None to include all.
| :param seed: seed for the random generator. Use -1 to pick a random seed. Defaults to -1.
| :param use_pandas: set true to return pandas data frame.
| :return: H2OTwoDimTable or Pandas data frame
|
| permutation_importance_plot(self, frame, metric='AUTO', n_samples=10000, n_repeats=1, features=None, seed=-1, num_of_features=10, server=False)
| Plot Permutation Variable Importance. This method plots either a bar plot or if n_repeats > 1 a box plot and
| returns the variable importance table.
|
| :param frame: training frame
| :param metric: metric to be used. One of "AUTO", "AUC", "MAE", "MSE", "RMSE", "logloss", "mean_per_class_error",
| "PR_AUC". Defaults to "AUTO".
| :param n_samples: number of samples to be evaluated. Use -1 to use the whole dataset. Defaults to 10 000.
| :param n_repeats: number of repeated evaluations. Defaults to 1.
| :param features: features to include in the permutation importance. Use None to include all.
| :param seed: seed for the random generator. Use -1 to pick a random seed. Defaults to -1.
| :param num_of_features: number of features to plot. Defaults to 10.
| :param server: if true set server settings to matplotlib and do not show the plot
| :return: H2OTwoDimTable with variable importance
|
| pprint_coef(self)
| Pretty print the coefficents table (includes normalized coefficients).
|
| pr_auc(self, train=False, valid=False, xval=False)
| ``ModelBase.pr_auc`` is deprecated, please use ``ModelBase.aucpr`` instead.
|
| predict(self, test_data, custom_metric=None, custom_metric_func=None)
| Predict on a dataset.
|
| :param H2OFrame test_data: Data on which to make predictions.
| :param custom_metric: custom evaluation function defined as class reference, the class get uploaded
| into the cluster
| :param custom_metric_func: custom evaluation function reference, e.g, result of upload_custom_metric
|
| :returns: A new H2OFrame of predictions.
|
| predict_contributions(self, test_data, output_format='Original', top_n=None, bottom_n=None, compare_abs=False)
| Predict feature contributions - SHAP values on an H2O Model (only GBM, XGBoost, DRF models and equivalent
| imported MOJOs).
|
| Returned H2OFrame has shape (#rows, #features + 1) - there is a feature contribution column for each input
| feature, the last column is the model bias (same value for each row). The sum of the feature contributions
| and the bias term is equal to the raw prediction of the model. Raw prediction of tree-based model is the sum
| of the predictions of the individual trees before the inverse link function is applied to get the actual
| prediction. For Gaussian distribution the sum of the contributions is equal to the model prediction.
|
| Note: Multinomial classification models are currently not supported.
|
| :param H2OFrame test_data: Data on which to calculate contributions.
| :param Enum output_format: Specify how to output feature contributions in XGBoost - XGBoost by default outputs
| contributions for 1-hot encoded features, specifying a Compact output format will produce a per-feature
| contribution. One of: ``"Original"``, ``"Compact"`` (default: ``"Original"``).
| :param top_n: Return only #top_n highest contributions + bias.
| If top_n<0 then sort all SHAP values in descending order
| If top_n<0 && bottom_n<0 then sort all SHAP values in descending order
| :param bottom_n: Return only #bottom_n lowest contributions + bias
| If top_n and bottom_n are defined together then return array of #top_n + #bottom_n + bias
| If bottom_n<0 then sort all SHAP values in ascending order
| If top_n<0 && bottom_n<0 then sort all SHAP values in descending order
| :param compare_abs: True to compare absolute values of contributions
| :returns: A new H2OFrame made of feature contributions.
|
| :examples:
| >>> prostate = "http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv"
| >>> fr = h2o.import_file(prostate)
| >>> predictors = list(range(2, fr.ncol))
| >>> m = H2OGradientBoostingEstimator(ntrees=10, seed=1234)
| >>> m.train(x=predictors, y=1, training_frame=fr)
| >>> # Compute SHAP
| >>> m.predict_contributions(fr)
| >>> # Compute SHAP and pick the top two highest
| >>> m.predict_contributions(fr, top_n=2)
| >>> # Compute SHAP and pick the top two lowest
| >>> m.predict_contributions(fr, bottom_n=2)
| >>> # Compute SHAP and pick the top two highest regardless of the sign
| >>> m.predict_contributions(fr, top_n=2, compare_abs=True)
| >>> # Compute SHAP and pick top two lowest regardless of the sign
| >>> m.predict_contributions(fr, bottom_n=2, compare_abs=True)
| >>> # Compute SHAP values and show them all in descending order
| >>> m.predict_contributions(fr, top_n=-1)
| >>> # Compute SHAP and pick the top two highest and top two lowest
| >>> m.predict_contributions(fr, top_n=2, bottom_n=2)
|
| predict_leaf_node_assignment(self, test_data, type='Path')
| Predict on a dataset and return the leaf node assignment (only for tree-based models).
|
| :param H2OFrame test_data: Data on which to make predictions.
| :param Enum type: How to identify the leaf node. Nodes can be either identified by a path from to the root node
| of the tree to the node or by H2O's internal node id. One of: ``"Path"``, ``"Node_ID"`` (default: ``"Path"``).
|
| :returns: A new H2OFrame of predictions.
|
| r2(self, train=False, valid=False, xval=False)
| Return the R squared for this regression model.
|
| Will return R^2 for GLM Models and will return NaN otherwise.
|
| The R^2 value is defined to be 1 - MSE/var, where var is computed as sigma*sigma.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the R^2 value for the training data.
| :param bool valid: If valid is True, then return the R^2 value for the validation data.
| :param bool xval: If xval is True, then return the R^2 value for the cross validation data.
|
| :returns: The R squared for this regression model.
|
| residual_degrees_of_freedom(self, train=False, valid=False, xval=False)
| Retreive the residual degress of freedom if this model has the attribute, or None otherwise.
|
| :param bool train: Get the residual dof for the training set. If both train and valid are False, then train
| is selected by default.
| :param bool valid: Get the residual dof for the validation set. If both train and valid are True, then train
| is selected by default.
|
| :returns: Return the residual dof, or None if it is not present.
|
| residual_deviance(self, train=False, valid=False, xval=None)
| Retreive the residual deviance if this model has the attribute, or None otherwise.
|
| :param bool train: Get the residual deviance for the training set. If both train and valid are False, then
| train is selected by default.
| :param bool valid: Get the residual deviance for the validation set. If both train and valid are True, then
| train is selected by default.
|
| :returns: Return the residual deviance, or None if it is not present.
|
| respmul(self)
| Normalization/Standardization multipliers for numeric response.
|
| respsub(self)
| Normalization/Standardization offsets for numeric response.
|
| rmse(self, train=False, valid=False, xval=False)
| Get the Root Mean Square Error.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the RMSE value for the training data.
| :param bool valid: If valid is True, then return the RMSE value for the validation data.
| :param bool xval: If xval is True, then return the RMSE value for the cross validation data.
|
| :returns: The RMSE for this regression model.
|
| rmsle(self, train=False, valid=False, xval=False)
| Get the Root Mean Squared Logarithmic Error.
|
| If all are False (default), then return the training metric value.
| If more than one options is set to True, then return a dictionary of metrics where the keys are "train",
| "valid", and "xval".
|
| :param bool train: If train is True, then return the RMSLE value for the training data.
| :param bool valid: If valid is True, then return the RMSLE value for the validation data.
| :param bool xval: If xval is True, then return the RMSLE value for the cross validation data.
|
| :returns: The RMSLE for this regression model.
|
| rotation(self)
| Obtain the rotations (eigenvectors) for a PCA model
|
| :return: H2OFrame
|
| save_model_details(self, path='', force=False, filename=None)
| Save Model Details of an H2O Model in JSON Format to disk.
|
| :param path: a path to save the model details at (hdfs, s3, local)
| :param force: if True overwrite destination directory in case it exists, or throw exception if set to False.
| :param filename: a filename for the saved model (file type is always .json)
|
| :returns str: the path of the saved model details
|
| save_mojo(self, path='', force=False, filename=None)
| Save an H2O Model as MOJO (Model Object, Optimized) to disk.
|
| :param path: a path to save the model at (hdfs, s3, local)
| :param force: if True overwrite destination directory in case it exists, or throw exception if set to False.
| :param filename: a filename for the saved model (file type is always .zip)
|
| :returns str: the path of the saved model
|
| score_history(self)
| DEPRECATED. Use :meth:`scoring_history` instead.
|
| scoring_history(self)
| Retrieve Model Score History.
|
| :returns: The score history as an H2OTwoDimTable or a Pandas DataFrame.
|
| shap_explain_row_plot(model, frame, row_index, columns=None, top_n_features=10, figsize=(16, 9), plot_type='barplot', contribution_type='both')
| SHAP local explanation
|
| SHAP explanation shows contribution of features for a given instance. The sum
| of the feature contributions and the bias term is equal to the raw prediction
| of the model, i.e., prediction before applying inverse link function. H2O implements
| TreeSHAP which when the features are correlated, can increase contribution of a feature
| that had no influence on the prediction.
|
| :param model: h2o tree model, such as DRF, XRT, GBM, XGBoost
| :param frame: H2OFrame
| :param row_index: row index of the instance to inspect
| :param columns: either a list of columns or column indices to show. If specified
| parameter top_n_features will be ignored.
| :param top_n_features: a number of columns to pick using variable importance (where applicable).
| When plot_type="barplot", then top_n_features will be chosen for each contribution_type.
| :param figsize: figure size; passed directly to matplotlib
| :param plot_type: either "barplot" or "breakdown"
| :param contribution_type: One of "positive", "negative", or "both".
| Used only for plot_type="barplot".
| :returns: a matplotlib figure object
|
| :examples:
| >>> import h2o
| >>> from h2o.estimators import H2OGradientBoostingEstimator
| >>>
| >>> h2o.init()
| >>>
| >>> # Import the wine dataset into H2O:
| >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
| >>> df = h2o.import_file(f)
| >>>
| >>> # Set the response
| >>> response = "quality"
| >>>
| >>> # Split the dataset into a train and test set:
| >>> train, test = df.split_frame([0.8])
| >>>
| >>> # Train a GBM
| >>> gbm = H2OGradientBoostingEstimator()
| >>> gbm.train(y=response, training_frame=train)
| >>>
| >>> # Create SHAP row explanation plot
| >>> gbm.shap_explain_row_plot(test, row_index=0)
|
| shap_summary_plot(model, frame, columns=None, top_n_features=20, samples=1000, colorize_factors=True, alpha=1, colormap=None, figsize=(12, 12), jitter=0.35)
| SHAP summary plot
|
| SHAP summary plot shows contribution of features for each instance. The sum
| of the feature contributions and the bias term is equal to the raw prediction
| of the model, i.e., prediction before applying inverse link function.
|
| :param model: h2o tree model, such as DRF, XRT, GBM, XGBoost
| :param frame: H2OFrame
| :param columns: either a list of columns or column indices to show. If specified
| parameter top_n_features will be ignored.
| :param top_n_features: a number of columns to pick using variable importance (where applicable).
| :param samples: maximum number of observations to use; if lower than number of rows in the
| frame, take a random sample
| :param colorize_factors: if True, use colors from the colormap to colorize the factors;
| otherwise all levels will have same color
| :param alpha: transparency of the points
| :param colormap: colormap to use instead of the default blue to red colormap
| :param figsize: figure size; passed directly to matplotlib
| :param jitter: amount of jitter used to show the point density
| :returns: a matplotlib figure object
|
| :examples:
| >>> import h2o
| >>> from h2o.estimators import H2OGradientBoostingEstimator
| >>>
| >>> h2o.init()
| >>>
| >>> # Import the wine dataset into H2O:
| >>> f = "https://h2o-public-test-data.s3.amazonaws.com/smalldata/wine/winequality-redwhite-no-BOM.csv"
| >>> df = h2o.import_file(f)
| >>>
| >>> # Set the response
| >>> response = "quality"
| >>>
| >>> # Split the dataset into a train and test set:
| >>> train, test = df.split_frame([0.8])
| >>>
| >>> # Train a GBM
| >>> gbm = H2OGradientBoostingEstimator()
| >>> gbm.train(y=response, training_frame=train)
| >>>
| >>> # Create SHAP summary plot
| >>> gbm.shap_summary_plot(test)
|
| show(self)
| Print innards of model, without regards to type.
|
| staged_predict_proba(self, test_data)
| Predict class probabilities at each stage of an H2O Model (only GBM models).
|
| The output structure is analogous to the output of function predict_leaf_node_assignment. For each tree t and
| class c there will be a column Tt.Cc (eg. T3.C1 for tree 3 and class 1). The value will be the corresponding
| predicted probability of this class by combining the raw contributions of trees T1.Cc,..,TtCc. Binomial models
| build the trees just for the first class and values in columns Tx.C1 thus correspond to the the probability p0.
|
| :param H2OFrame test_data: Data on which to make predictions.
|
| :returns: A new H2OFrame of staged predictions.
|
| std_coef_plot(self, num_of_features=None, server=False)
| Plot a model's standardized coefficient magnitudes.
|
| :param num_of_features: the number of features shown in the plot.
| :param server: if true set server settings to matplotlib and show the graph
|
| :returns: None.
|
| summary(self)
| Print a detailed summary of the model.
|
| training_model_metrics(self)
| Return training model metrics for any model.
|
| update_tree_weights(self, frame, weights_column)
| Re-calculates tree-node weights based on provided dataset. Modifying node weights will affect how
| contribution predictions (Shapley values) are calculated. This can be used to explain the model
| on a curated sub-population of the training dataset.
|
| :param frame: frame that will be used to re-populate trees with new observations and to collect per-node weights
| :param weights_column: name of the weight column (can be different from training weights)
|
| varimp(self, use_pandas=False)
| Pretty print the variable importances, or return them in a list.
|
| :param bool use_pandas: If True, then the variable importances will be returned as a pandas data frame.
|
| :returns: A list or Pandas DataFrame.
|
| varimp_plot(self, num_of_features=None, server=False)
| Plot the variable importance for a trained model.
|
| :param num_of_features: the number of features shown in the plot (default is 10 or all if less than 10).
| :param server: if true set server settings to matplotlib and do not show the graph
|
| :returns: None.
|
| weights(self, matrix_id=0)
| Return the frame for the respective weight matrix.
|
| :param matrix_id: an integer, ranging from 0 to number of layers, that specifies the weight matrix to return.
|
| :returns: an H2OFrame which represents the weight matrix identified by matrix_id
|
| xval_keys(self)
| Return model keys for the cross-validated model.
|
| ----------------------------------------------------------------------
| Readonly properties inherited from h2o.model.model_base.ModelBase:
|
| actual_params
| Dictionary of actual parameters of the model.
|
| default_params
| Dictionary of the default parameters of the model.
|
| end_time
| Timestamp (milliseconds since 1970) when the model training was ended.
|
| full_parameters
| Dictionary of the full specification of all parameters.
|
| have_mojo
| True, if export to MOJO is possible
|
| have_pojo
| True, if export to POJO is possible
|
| key
| :return: the unique key representing the object on the backend
|
| params
| Get the parameters and the actual/default values only.
|
| :returns: A dictionary of parameters used to build this model.
|
| run_time
| Model training time in milliseconds
|
| start_time
| Timestamp (milliseconds since 1970) when the model training was started.
|
| type
| The type of model built: ``"classifier"`` or ``"regressor"`` or ``"unsupervised"``
|
| xvals
| Return a list of the cross-validated models.
|
| :returns: A list of models.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from h2o.model.model_base.ModelBase:
|
| model_id
| Model identifier.
|
| ----------------------------------------------------------------------
| Data descriptors inherited from h2o.base.Keyed:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
Help on function import_file in module h2o.h2o:
import_file(path=None, destination_frame=None, parse=True, header=0, sep=None, col_names=None, col_types=None, na_strings=None, pattern=None, skipped_columns=None, custom_non_data_line_markers=None, partition_by=None, quotechar=None, escapechar=None)
Import a dataset that is already on the cluster.
The path to the data must be a valid path for each node in the H2O cluster. If some node in the H2O cluster
cannot see the file, then an exception will be thrown by the H2O cluster. Does a parallel/distributed
multi-threaded pull of the data. The main difference between this method and :func:`upload_file` is that
the latter works with local files, whereas this method imports remote files (i.e. files local to the server).
If you running H2O server on your own machine, then both methods behave the same.
:param path: path(s) specifying the location of the data to import or a path to a directory of files to import
:param destination_frame: The unique hex key assigned to the imported file. If none is given, a key will be
automatically generated.
:param parse: If True, the file should be parsed after import. If False, then a list is returned containing the file path.
:param header: -1 means the first line is data, 0 means guess, 1 means first line is header.
:param sep: The field separator character. Values on each line of the file are separated by
this character. If not provided, the parser will automatically detect the separator.
:param col_names: A list of column names for the file.
:param col_types: A list of types or a dictionary of column names to types to specify whether columns
should be forced to a certain type upon import parsing. If a list, the types for elements that are
one will be guessed. The possible types a column may have are:
:param partition_by Names of the column the persisted dataset has been partitioned by.
- "unknown" - this will force the column to be parsed as all NA
- "uuid" - the values in the column must be true UUID or will be parsed as NA
- "string" - force the column to be parsed as a string
- "numeric" - force the column to be parsed as numeric. H2O will handle the compression of the numeric
data in the optimal manner.
- "enum" - force the column to be parsed as a categorical column.
- "time" - force the column to be parsed as a time column. H2O will attempt to parse the following
list of date time formats: (date) "yyyy-MM-dd", "yyyy MM dd", "dd-MMM-yy", "dd MMM yy", (time)
"HH:mm:ss", "HH:mm:ss:SSS", "HH:mm:ss:SSSnnnnnn", "HH.mm.ss" "HH.mm.ss.SSS", "HH.mm.ss.SSSnnnnnn".
Times can also contain "AM" or "PM".
:param na_strings: A list of strings, or a list of lists of strings (one list per column), or a dictionary
of column names to strings which are to be interpreted as missing values.
:param pattern: Character string containing a regular expression to match file(s) in the folder if `path` is a
directory.
:param skipped_columns: an integer list of column indices to skip and not parsed into the final frame from the import file.
:param custom_non_data_line_markers: If a line in imported file starts with any character in given string it will NOT be imported. Empty string means all lines are imported, None means that default behaviour for given format will be used
:param quotechar: A hint for the parser which character to expect as quoting character. Only single quote, double quote or None (default) are allowed. None means automatic detection.
:param escapechar: (Optional) One ASCII character used to escape other characters.
:returns: a new :class:`H2OFrame` instance.
:examples:
>>> birds = h2o.import_file("https://s3.amazonaws.com/h2o-public-test-data/smalldata/pca_test/birds.csv")